Proceedings of the Seminar Hardware/Software Codesign
Proceedings of the Seminar Hardware/Software Codesign
Proceedings of the Seminar Hardware/Software Codesign
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Proceedings</strong> <strong>of</strong> <strong>the</strong><br />
<strong>Seminar</strong> <strong>Hardware</strong>/S<strong>of</strong>tware <strong>Codesign</strong><br />
Lecturer:<br />
Jun.-Pr<strong>of</strong>. Dr. Christian Plessl<br />
Participants:<br />
Erik Bonner<br />
Wei Cao<br />
Denis Dridger<br />
Christoph Kleineweber<br />
Sandeep Korrapati<br />
André Koza<br />
Pavithra Rajendran<br />
Maryam Sanati<br />
Gavin Vaz<br />
WS 2011/12<br />
University <strong>of</strong> Paderborn
Contents<br />
1 An Introduction to Automatic Memory Partitioning<br />
Erik Bonner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
2 Error Detection Technique and its Optimization for Real-Time Embedded<br />
Systems<br />
Wei Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
3 CPU vs. GPU: Which One Will Come Out on Top? Why There is no<br />
Simple Answer<br />
Denis Dridger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />
4 Will Dark Silicon Limit Multicore Scaling?<br />
Christoph Kleineweber . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
5 Guiding Computation Accelerators to Performance Optimization Dynamically<br />
Sandeep Korrapati . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />
6 A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
André Koza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />
7 Warp processing<br />
Maryam Sanati . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
8 Performance Modeling <strong>of</strong> Embedded Applications with Zero Architectural<br />
Knowledge<br />
Pavithra Rajendran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
9 Improving Application Launch Times<br />
Gavin Vaz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />
ii
An Introduction to Automatic Memory Partitioning<br />
Erik Bonner<br />
University <strong>of</strong> Paderborn<br />
berik@mail.uni-paderborn.de<br />
January 12, 2012<br />
Abstract<br />
This paper presents Automatic Memory Partitioning, a method for automatically<br />
increasing a program’s data parellism splitting by splitting its data structures into<br />
segments and assigning <strong>the</strong>m to separate, simultaneously accessible memory banks.<br />
Unlike o<strong>the</strong>r data optimization methods, Automatic Memory Partitioning uses dynamic<br />
analysis methods to identify partitionable memory. After partitioning, <strong>the</strong> set<br />
<strong>of</strong> partitioned memory regions is assigned to a set <strong>of</strong> available memory banks by<br />
solving a budgeted graph colouring problem by means <strong>of</strong> Integer Linear Programming<br />
(ILP). After introducing Automatic Memory Partitioning, this paper <strong>of</strong>fers a<br />
discussion on its merits and pitfalls.<br />
1 Introduction<br />
Field Programmable Gate-Arrays (FPGAs) and o<strong>the</strong>r embedded systems can organize<br />
memory into multiple memory banks, which can be accessed simultaneously. Since many<br />
applications are memory-bound, organizing memory into separate memory banks such<br />
that data parallelism is increased during execution can be a powerful means <strong>of</strong> improving<br />
program performance. Consider, for example, <strong>the</strong> code in Listing 1. If all memory were<br />
organized in a single memory bank, and <strong>the</strong> latency for a read were a single clock cycle,<br />
3 clock cycles would be necessary to access <strong>the</strong> memory required to compute <strong>the</strong> result<br />
sum. If, however, each <strong>of</strong> <strong>the</strong> arrays a, b and c were stored in seperate memory banks,<br />
<strong>the</strong> necessary result could be obtained in a single clock cycle.<br />
1 f o r ( i n t i = 0 ; i < ARRAY_SIZE ; i ++)<br />
2 sum [ i ] = a [ i ] + b [ i ] + c [ i ] ;<br />
Listing 1: An example <strong>of</strong> code that benefits from memory parallelization. Listing source:<br />
[3].<br />
2
Erik Bonner<br />
Arrays <strong>of</strong> data structures are <strong>of</strong>ten linearly traversed and, in each iteration, several<br />
components are accessed in <strong>the</strong> same basic block. An example is given in Listing 2. Two<br />
for-loops traverse an array <strong>of</strong> structs <strong>of</strong> type point3d, accessing all three fields x, y<br />
and z in each iteration. When serviced by a single memory bank, extracting <strong>the</strong> contents<br />
<strong>of</strong> each point3d object will have a latency <strong>of</strong> 3 cycles. However, if <strong>the</strong> contents <strong>of</strong> each<br />
point object can be distributed across several different memory banks, each position<br />
extraction can be performed in a single clock cycle.<br />
1 void f i n d _ s t a r s h i p ( p o i n t ∗ s t a r s , i n t n , p o i n t 3 d ∗ ship ,<br />
2 i n t m, i n t ∗ a v a i l )<br />
3 {<br />
4 i n t sx =0 , sy =0 , sz =0 , b =0;<br />
5<br />
6 / / f i n d g a l a x y c e n t e r<br />
7 f o r ( i n t i =0; i
An Introduction to Automatic Memory Partitioning<br />
Figure 1: An example <strong>of</strong> <strong>the</strong> memory used in Listing 1 paritioned and distributed across<br />
3 memory banks. Figure inspired by a similar diagram in [3].<br />
Memory Partitioning identifies seperately accessed memory regions and, by solving a<br />
budgeted graph-colouring algorithm using Integer Linear Programming, assigns (partitions)<br />
<strong>the</strong>m to a mimial set <strong>of</strong> memory banks (see Section 3.2). The particular focus <strong>of</strong><br />
this technique is <strong>the</strong> splitting <strong>of</strong> complex data structures into <strong>the</strong>ir constituent fields and<br />
assigning <strong>the</strong>m to different memory banks, thus greatly accelerating code similar to <strong>the</strong><br />
example given in Listing 2.<br />
After <strong>the</strong> introduction to <strong>the</strong> problem addressed by Automatic Memory Partitioning<br />
given in this section, Section 2 discusses current approaches to memory parallelization<br />
in <strong>the</strong> Literature. Section 3 <strong>the</strong>n discusses <strong>the</strong> Automatic Partitioning Method in detail.<br />
The evaluation results reported by <strong>the</strong> technique authors, Asher and Ro<strong>the</strong>m [3], are given<br />
in Section 4. Finally, critical discussion is provided in Section 5, before <strong>the</strong> paper is<br />
concluded in Section 6.<br />
2 Related Work<br />
A number <strong>of</strong> memory reshaping and partitioning techniques have been proposed in <strong>the</strong><br />
Literature for improving application performance. The majority <strong>of</strong> <strong>the</strong>se are based on<br />
static analysis <strong>of</strong> application source code, i.e analysis that can be performed at compile<br />
time.<br />
Zhao et al. [9] proposed Forma, which is an automatic data reshaping technique performed<br />
(transparently to <strong>the</strong> programmer) at compile time. The aim <strong>of</strong> Forma is to reshape<br />
arrays <strong>of</strong> structs in order to improve data locality and <strong>the</strong>reby optimize cache usage. For<br />
example, consider <strong>the</strong> code given in Listing 3. The for-loop on lines 3-6 traverses an array<br />
<strong>of</strong> point3d objects, accessing only <strong>the</strong> .x field in each iteration. If left unmodified,<br />
running <strong>the</strong> code in Listing 3 results in poor cache performance. Although only <strong>the</strong> .x<br />
field is used in each iteration, due to <strong>the</strong>ir proximity in memory to <strong>the</strong> .x field, <strong>the</strong> .y<br />
and .z fields <strong>of</strong> each structure element will also be fetched to cache, causing significant<br />
cache clutter.<br />
4
Erik Bonner<br />
1 / / compute a v e r a g e x c o o r d i n a t e<br />
2 i n t sumx = 0 , avg = 0 ;<br />
3 f o r ( i n t i = 0 ; i < NUM_STARS; i ++)<br />
4 {<br />
5 sumx += s t a r s [ i ] . x ;<br />
6 }<br />
7 avg = sumx /NUM_STARS;<br />
Listing 3: Code suitable for optimization with Forma.<br />
By combining statistics ga<strong>the</strong>red from execution pr<strong>of</strong>iling, which identify <strong>the</strong> usage<br />
frequency and affinity <strong>of</strong> <strong>the</strong> structure fields, with static code analysis, data structure object<br />
fields in Listing 3 can be partitioned and <strong>the</strong> stars array reshaped to support <strong>the</strong><br />
data locality present in <strong>the</strong> program execution. Figure 2 shows how <strong>the</strong> stars array<br />
could be reshaped to improve cache performance. Although Forma is primarily targeted<br />
at devices with traditional memory heirarchies, <strong>the</strong> data structure partitioning and array<br />
reshaping used can be adapted to target platforms with multiple memory banks.<br />
Figure 2: An example <strong>of</strong> reshaping an array <strong>of</strong> point3d objects using Forma. Using <strong>the</strong><br />
restructured array, <strong>the</strong> traversal in Listing 3 would enjoy a significantly improved cache<br />
performance.<br />
Lattner and Adve et al. [7] proposed a technique called Automatic Pool Allocation,<br />
which, by means <strong>of</strong> static pointer analysis, improves performance <strong>of</strong> heap-based data<br />
structures (such as linked-lists or trees) by partitioning allocation <strong>of</strong> individual complex<br />
objects into different memory pools. For example, <strong>the</strong> nodes in a linked list can be automatically<br />
allocated in different memory pools. By controlling allocation <strong>of</strong> objects within<br />
pools, <strong>the</strong> compiler can ensure that memory can be structured in an aligned format, which<br />
greatly improves data locality. Figure 3 compares <strong>the</strong> memory structure <strong>of</strong> a linked-list<br />
allocated using traditional allocators (such as malloc()) with one whose nodes have<br />
5
An Introduction to Automatic Memory Partitioning<br />
been allocated using Automatic Pool Allocation. Since <strong>the</strong> linked-list nodes are not scattered<br />
throughout memory in <strong>the</strong> latter example, a traversal <strong>of</strong> <strong>the</strong> linked-list will benefit<br />
from improved cache performance.<br />
(a) (b)<br />
Figure 3: An example <strong>of</strong> Automatic Pool Allocation. The figure on <strong>the</strong> left (a) shows a set<br />
<strong>of</strong> nodes belonging to a linked list, allocated using traditional methods in main memory.<br />
The nodes are scattered randomly throughout memory. The figure on <strong>the</strong> right (b) shows<br />
<strong>the</strong> same nodes allocated using Automatic Pool Allocation. Using this method, new nodes<br />
are allocated in so-called “pools”, which are dedicated memory regions ensuring contiguous<br />
node allocation.<br />
Like Forma, Automatic Pool Allocation is a static technique primarily intended for<br />
use during compilation for traditional, heirarchy-based memory architectures. However,<br />
also like Forma, Automatic Pool Allocation can be readily adapted to architectures using<br />
multiple memory banks.<br />
Curial et al. [5] proposed a method called MPADS (Memory-Pooling-Assisted Data<br />
Splitting), which can be considered a combination <strong>of</strong> Forma and Automatic Pool Allocation.<br />
Using this method, individual objects <strong>of</strong> complex data structure types are split among<br />
memory pools. In this aspect, MPADS <strong>of</strong>fers very similar functionality to <strong>the</strong> Automatic<br />
Memory Partitioning technique described in this paper. Unlike Automatic Memory Paritioning,<br />
however, MPADS accomplishes its memory splitting and allocation purely using<br />
static code analysis, which, <strong>the</strong>y argue, has <strong>the</strong> advantage <strong>of</strong> avoiding <strong>the</strong> generation <strong>of</strong><br />
large memory traces. On <strong>the</strong> o<strong>the</strong>r hand, MPADS is designed for use with commercial<br />
compilers, and <strong>the</strong>refore must be more minimalistic and pessimistic in its approach than<br />
o<strong>the</strong>r, research specific methods. For example, if <strong>the</strong>re is a chance that a potential memory<br />
transformation could modify <strong>the</strong> semantics <strong>of</strong> <strong>the</strong> target program, <strong>the</strong> transformation<br />
is abandoned.<br />
The main contribution <strong>of</strong> Automatic Memory Paritioning, which is not addressed by<br />
<strong>the</strong> related work, is a combination <strong>of</strong> data structure partitioning and dynamic code analysis.<br />
This entails analysing <strong>the</strong> program according to its dynamic behaviour, ra<strong>the</strong>r than<br />
6
Erik Bonner<br />
analysing its code statically at compile time. The pros and cons <strong>of</strong> using this approach are<br />
discussed in Section 5.<br />
3 Proposed Technique<br />
Automatic Memory Partitioning is a technique for optimizing linear traversal <strong>of</strong> data<br />
structure arrays on embedded devices (primarily FPGAs) that organize memory in a<br />
set <strong>of</strong> simultaneously accessible memory banks. By automatically partitioning program<br />
data structures such that individual structure components are placed in different memory<br />
banks, linear traversals <strong>of</strong> data structure arrays are significantly accelerated (see <strong>the</strong><br />
example in Section 1).<br />
Automatic Memory Partitioning consists <strong>of</strong> two main stages: identifying <strong>the</strong> set <strong>of</strong> disjoint<br />
memory access patterns within a program/kernel execution, and assigning memory<br />
regions to a minimal set <strong>of</strong> memory banks. These techniques are described in Sections<br />
3.1 and 3.2, respectively. Once memory has been redistributed into banks, all pointers<br />
accessing this memory must be updated. This process is described in Section 3.3.<br />
3.1 Linear Memory Pattern decomposition<br />
3.1.1 Linear Memory Patterns (LMPs)<br />
The first step in <strong>the</strong> proposed method is to decompose <strong>the</strong> overall memory signature <strong>of</strong><br />
a program execution into a set (lp0, ..., lpk) <strong>of</strong> disjunct Linear Memory Patterns (LMPs),<br />
where:<br />
• Each load in <strong>the</strong> code is associated with an LMP lpi.<br />
• Each LMP lpi represents a set <strong>of</strong> sequentially spaced memory addresses <strong>of</strong> <strong>the</strong> form<br />
αx + β, where β is <strong>the</strong> <strong>of</strong>fset <strong>of</strong> <strong>the</strong> first memory access, α <strong>the</strong> stride separating<br />
adjacent accesses and x an integer between 0 and some upper bound n.<br />
• Each memory operation in <strong>the</strong> program is mapped to exactly one LMP, which spans<br />
all memory addresses associated with that operation’s signature.<br />
3.1.2 Memory pr<strong>of</strong>iling<br />
Unlike <strong>the</strong> memory partitioning methods discussed in Section 2, <strong>the</strong> set <strong>of</strong> LMPs existing<br />
in a program’s memory signature is identified by means <strong>of</strong> dynamic program analysis.<br />
To obtain <strong>the</strong> memory trace <strong>of</strong> an execution, <strong>the</strong> program source code is instrumented<br />
such that a call to a custom function is inserted immediately prior to each memory operation<br />
opcode. When <strong>the</strong> instrumented binary is executed, <strong>the</strong> custom functions write <strong>the</strong><br />
identifier and operand address(es) <strong>of</strong> each memory operator to a log on disk. After execution,<br />
<strong>the</strong> contents <strong>of</strong> <strong>the</strong> log make up a complete memory trace <strong>of</strong> <strong>the</strong> program execution.<br />
Figure 4 shows a portion <strong>of</strong> a sample memory trace log.<br />
7
An Introduction to Automatic Memory Partitioning<br />
Figure 4: An example memory trace log. Image source [3].<br />
The example trace in Figure 4 logs four op codes (referred to as #7, #12, #17 and #22)<br />
consecutively operating on <strong>the</strong> fields <strong>of</strong> an array <strong>of</strong> adjacently allocated data structure<br />
objects. The address upon which each op code operates is given in <strong>the</strong> left-most table<br />
column, and <strong>the</strong> basic block to which <strong>the</strong>y belong is specified in <strong>the</strong> right-most column.<br />
3.1.3 Data structure decomposition<br />
Once <strong>the</strong> memory trace <strong>of</strong> an execution has been generated, it is analysed to determine a<br />
set <strong>of</strong> LMPs that can correctly represent <strong>the</strong> program memory pr<strong>of</strong>ile. For <strong>the</strong> analysis, an<br />
LMP is defined as a 4-tuple Rl, Rh, Op, S, where Rl and Rh define <strong>the</strong> upper and lower<br />
bounds on <strong>the</strong> memory range, repectivley; Op defines a set <strong>of</strong> memory operations that<br />
operate on addresses within this range; and S, which corresponds to α in Section 3.1.1,<br />
defines <strong>the</strong> stride between each potential access.<br />
Listing 4 shows an example code snippet for looping through an array <strong>of</strong> point3d<br />
objects and accessing <strong>the</strong> .x structure field. Since <strong>the</strong> array <strong>of</strong> structs is allocated as<br />
a contigious memory region <strong>of</strong> adjacent struct elements, each read <strong>of</strong> <strong>the</strong> .x field is<br />
seperated by a distance <strong>of</strong> size<strong>of</strong>(point) bytes. Fur<strong>the</strong>rmore, since <strong>the</strong> memory<br />
operation applied to this field is alternating between reading and writing, <strong>the</strong> LMP propery<br />
Op contains both read and write opcodes. Finally, <strong>the</strong> memory range defined by Rl and<br />
Rh spans 100*size<strong>of</strong>(point) bytes. The diagram in Figure 5 visualizes <strong>the</strong> LMP,<br />
denoted lp0, constructed from <strong>the</strong> code in Listing 4.<br />
1 p o i n t 3 d p a r r a y [ 1 0 0 ] ;<br />
2 f o r ( i n t i = 0 ; i < 100; i ++)<br />
3 {<br />
4 i f ( i%2 == 0)<br />
5 do_some_computation ( p a r r a y [ i ] . x ) ;<br />
6 e l s e<br />
8
Erik Bonner<br />
7 p a r r a y [ i ] . x = s o m e _ o t h e r _ c o m p u t a t i o n ( ) ;<br />
8 }<br />
Listing 4: Simple code for looping through an array <strong>of</strong> structs, alternating between reading<br />
and writing.<br />
Figure 5: A view <strong>of</strong> memory during <strong>the</strong> execution <strong>of</strong> <strong>the</strong> code in Listing 3. An LMP, lp0,<br />
can be constructed to represent <strong>the</strong> accesses to <strong>the</strong> field parray[i].x (marked in yellow).<br />
The LMP range, Rl and Rh; set <strong>of</strong> operations, Op; and stride, S, which is this case is<br />
equal to size<strong>of</strong>(point3d), are marked in <strong>the</strong> diagram.<br />
The pseudocode given in Figure 6 demonstrates how <strong>the</strong> set <strong>of</strong> LMPs for a given<br />
program execution can be extracted from its memory trace. The first loop, on Lines 1<br />
to 11, creates an LMP for each opcode in <strong>the</strong> set <strong>of</strong> all identified opcodes found in <strong>the</strong><br />
memory trace. Note that this part <strong>of</strong> <strong>the</strong> algorithm can be performed online, while <strong>the</strong><br />
memory trace is being generated. The second loop compares each identified LMP with<br />
all o<strong>the</strong>r identified LMPs to determine if any two can be merged. Two LMPs can be<br />
merged if <strong>the</strong>y operate on common memory cells. This will be true for two candidate<br />
LMPs if both <strong>of</strong> <strong>the</strong> following conditions hold:<br />
1. There is an intersection <strong>of</strong> <strong>the</strong> candidate ranges.<br />
2. Both candidates have <strong>the</strong> same <strong>of</strong>fset within <strong>the</strong>ir stride.<br />
When traversing an array <strong>of</strong> complex data types, <strong>the</strong> traversal stride represents <strong>the</strong><br />
size <strong>of</strong> <strong>the</strong> complex data type object, and <strong>the</strong> <strong>of</strong>fset within <strong>the</strong> stride indicates which<br />
field within <strong>the</strong> data structure is being accessed. For example, consider two functions:<br />
compute_x() and compute_y(). The function body <strong>of</strong> compute_x() is made <strong>of</strong><br />
up <strong>of</strong> <strong>the</strong> code given in Listing 4, while <strong>the</strong> body <strong>of</strong> compute_y() is nearly identical to<br />
that <strong>of</strong> compute_x(), with <strong>the</strong> exception that it operates on <strong>the</strong> parray[i].y field.<br />
The LMPs extracted from <strong>the</strong> traces <strong>of</strong> <strong>the</strong>se functions would have indentical strides and<br />
largely overlapping ranges. However, since <strong>the</strong>y are accessing different elements <strong>of</strong> <strong>the</strong><br />
point3d data structure, <strong>the</strong> <strong>of</strong>fset within <strong>the</strong>ir strides differ. Therefore, <strong>the</strong> LMPs <strong>of</strong> <strong>the</strong><br />
compute_x() and compute_y() functions will not be mergable.<br />
9
An Introduction to Automatic Memory Partitioning<br />
Two candidate LMPs are merged by setting <strong>the</strong> merged range to <strong>the</strong> minimum and<br />
maximum <strong>of</strong> <strong>the</strong>ir respective upper and lower range bounds, setting <strong>the</strong> LMP Op field to<br />
<strong>the</strong> union <strong>of</strong> <strong>the</strong> both candidate Ops and setting <strong>the</strong> merged stride to <strong>the</strong> greatest common<br />
devisor <strong>of</strong> <strong>the</strong> two candidate strides. After <strong>the</strong> second nested loop (Lines 12-23), <strong>the</strong><br />
set <strong>of</strong> disjoint LMPs present in program execution has been identified and is ready for<br />
assignment to <strong>the</strong> available memory banks.<br />
Figure 6: The algorithm used for extracting a set <strong>of</strong> LMPs from a memory trace. Image<br />
source [3].<br />
3.2 Memory bank allocation<br />
Once <strong>the</strong> set <strong>of</strong> LMPs present during execution have been identified, <strong>the</strong> memory referenced<br />
by each LMP must be assigned to memory banks in an optimal manner. In order<br />
to accomplish this, <strong>the</strong> set <strong>of</strong> LMPs must be assigned to a set <strong>of</strong> K memory banks with<br />
known capacities, such that:<br />
• Maximum memory parallelism can be achieved.<br />
• The capacity <strong>of</strong> each memory bank is sufficient to store all LMPs assigned to it.<br />
• A minimal number <strong>of</strong> banks are used.<br />
10
Erik Bonner<br />
The optimal assignment <strong>of</strong> LMPs to memory banks is attained by solving a modified<br />
graph colouring problem. The traditional graph colouring problem is formulated as follows.<br />
Given a graph G = (V, E), where V is a set <strong>of</strong> vertices and E is <strong>the</strong> set <strong>of</strong> edges<br />
connecting <strong>the</strong>m, a mapping φ : V → C is sought such that ∀(u, v) ∈ G, c(u) �= c(v),<br />
where <strong>the</strong> function c() assigns a “colour” to each vertex. In o<strong>the</strong>r words, given a graph,<br />
<strong>the</strong> graph colouring problem involves assigning a set <strong>of</strong> colours (or generally, some value)<br />
to <strong>the</strong> graph vertices such that no adjacent vertices are assigned <strong>the</strong> same colour. For <strong>the</strong><br />
assignment <strong>of</strong> LMPs to memory banks, <strong>the</strong> set <strong>of</strong> LMPs are <strong>the</strong> graph vertices and <strong>the</strong><br />
set <strong>of</strong> memory banks are <strong>the</strong> assignable colours. Two vertices are connected by an edge<br />
if <strong>the</strong>ir LMPs cannot be assigned to <strong>the</strong> same memory bank. Fur<strong>the</strong>rmore, an additional<br />
constraint is added to <strong>the</strong> problem: each LMP, or vertex, has an associated size, and each<br />
bank, or colour, has a limited capacity. LMPs must be assigned to banks such that no<br />
bank has its capacity exceeded. This is known as a budgeted graph colouring problem.<br />
Figure 7 shows a simple example <strong>of</strong> <strong>the</strong> budgeted graph colouring problem, solved for a<br />
set <strong>of</strong> 5 nodes and 3 colours.<br />
Figure 7: An example <strong>of</strong> a solved budgeted graph colouring problem. Each node has an<br />
associated size value and each colour has an associated capacity. Nodes must be assigned<br />
to colours such that <strong>the</strong> total size <strong>of</strong> all nodes assigned to a given colour does not exceed<br />
that colour’s capacity. Figure redrawn from [3].<br />
Generally, <strong>the</strong> problem <strong>of</strong> graph colouring is NP-complete [6]. A common problem<br />
for which graph colouring is used is <strong>the</strong> assignment <strong>of</strong> variables to registers in compilers<br />
[4]. Accordingly, a number <strong>of</strong> heuristic-based solution strategies have been proposed.<br />
In Automatic Memory Partitioning, <strong>the</strong> memory bank allocation problem is solved using<br />
Integer Linear Programming (ILP). Budgeted graph colouring is structured with an ILP<br />
problem as follows. For n LMPs and m memory banks, a set <strong>of</strong> mxn boolean variables are<br />
defined such that <strong>the</strong> variable xij is 1 if LMP i is assigned to memory bank j. Fur<strong>the</strong>rmore,<br />
for each memory bank, a boolean variable cj indicates whe<strong>the</strong>r that memory bank is<br />
currently being used. By minimizing (c0, ..., cm) subject to a number <strong>of</strong> constraints, an<br />
optimal bank allocation can be found. The minimization constraints are defined as:<br />
• Each LMP is assigned to exactly one memory bank:<br />
∀i( � m j=0 xij ≥ 1 and � m j=0 xij ≤ 1)<br />
11
• No memory bank is overfilled:<br />
∀j � n i=0 xij ∗ size<strong>of</strong>(LMPi) ≤ size<strong>of</strong>(bankj)<br />
An Introduction to Automatic Memory Partitioning<br />
• Confilicting LMPs cannot be assigned to <strong>the</strong> same bank:<br />
∀j(xvj + xwj) ≤ 1, where v and w are conflicting LMPs.<br />
The above ILP problem is solved using <strong>the</strong> freeware CVXOPT s<strong>of</strong>tware package.<br />
3.3 Pointer syn<strong>the</strong>sis<br />
Once memory has been correctly rearranged into a minimal set <strong>of</strong> memory banks, all<br />
pointers in <strong>the</strong> target program accessing this memory must be reassigned accordingly.<br />
Consider <strong>the</strong> memory bank depicted in Figure 8, which contains three LMPs. In original<br />
memory, each LMP has an associated starting address (Rl), size (Rh − Rl) and stride<br />
(S). When assigned to a memory bank, <strong>the</strong>se LMP properties must be updated such that<br />
memory is correctly addressed within <strong>the</strong> assigned bank.<br />
Figure 8: A single memory bank with three LMPs (lpi, lpj and lpk) assigned to it. Each<br />
LMP has an associated size and <strong>of</strong>fset within <strong>the</strong> bank.<br />
For each pointer Pold that accesses <strong>the</strong> LMP in original memory, <strong>the</strong> following steps<br />
are taken to determine its new value Pnew within <strong>the</strong> assigned memory bank. First, <strong>the</strong><br />
start address Rl is subtracted from Pold. Then, since <strong>the</strong> memory accessed by each LMP<br />
will be packed into <strong>the</strong> assigned memory bank linearly, <strong>the</strong> final LMP stride must be<br />
adjusted. This is accomplished by scaling each old pointer value by a factor ˆs, where ˆs<br />
is a multiple its LMP stride. Finally, <strong>the</strong> starting address ˆ b <strong>of</strong> <strong>the</strong> LMP within its newly<br />
assigned memory bank must be added. The complete pointer mapping is given by:<br />
�<br />
Pold − Rl<br />
Pnew =<br />
ˆs<br />
+ ˆ � �<br />
Pold Rl<br />
b = −<br />
ˆs ˆs + ˆ � � �<br />
Pold<br />
b = ± C<br />
ˆs<br />
where C is a constant for each LMP. The most expensive part <strong>of</strong> this mapping is <strong>the</strong><br />
operation Pold . However, when ˆs is a power <strong>of</strong> two, this can be implemented using bit-<br />
ˆs<br />
shifting, which is a cheap operation on FPGAs.<br />
4 Reported Results<br />
Automatic Memory Partitioning performance was evaluated by syn<strong>the</strong>sising a collection<br />
<strong>of</strong> memory-intensive programs from <strong>the</strong> NVIDIA CUDA SDK [8], CLAPACK SDK [1]<br />
and SystemRacer test suite [2]. The samples were syn<strong>the</strong>sized with single, as well as<br />
12
Erik Bonner<br />
(a) (b)<br />
Figure 9: Evaluation. The table on <strong>the</strong> left (a) lists <strong>the</strong> name <strong>of</strong> each test program (left<br />
column), <strong>the</strong> number <strong>of</strong> cycles per iteration when using a single memory bank (centerleft<br />
column) and multiple memory banks (center-right column), as well as <strong>the</strong> number <strong>of</strong><br />
memory banks used for Automatic Memory Partitioning (right column). These results are<br />
visualized in <strong>the</strong> graph on <strong>the</strong> right (b). Images from [3].<br />
.<br />
multiple memory banks, and <strong>the</strong> resulting performances were compared. All programs<br />
were syn<strong>the</strong>sized to Verilog using <strong>the</strong> SystemRacer syn<strong>the</strong>sis engine. Each memory bank<br />
was syn<strong>the</strong>sized with a single memory port and each memory port had a latency <strong>of</strong> 3<br />
cycles. A comparison <strong>of</strong> <strong>the</strong> performance measured for <strong>the</strong> test programs syn<strong>the</strong>sized<br />
with a single vs. multiple memory banks is given in Figure 9.<br />
In most cases, it was possible to syn<strong>the</strong>size <strong>the</strong> target code using more than one memory<br />
bank. In all such cases, performance improvements were recorded when running <strong>the</strong><br />
multiple bank versions. As can be expected, <strong>the</strong> more banks used, <strong>the</strong> greater <strong>the</strong> memory<br />
parallelism, and hence <strong>the</strong> greater <strong>the</strong> performance gains.<br />
5 Discussion<br />
This section provides additional discussion and remarks regarding <strong>the</strong> Automatic Memory<br />
Partitioning method.<br />
In <strong>the</strong> original paper by Asher et al. [3] it is claimed that, unlike previously existing<br />
methods, Automatic Memory Partitioning performs memory optimization by means <strong>of</strong><br />
dynamic analysis. Although this is true, <strong>the</strong>re are some significant limitations. A target<br />
application’s memory is partitioned based on an analysis <strong>of</strong> its memory trace, generated<br />
during a pr<strong>of</strong>iling run. For <strong>the</strong> method to work, it is necessary that memory addresses<br />
and usage are indentical between runs. For many applications, particularly those whose<br />
control flow is data-dependent, this means that <strong>the</strong> memory partitioning will only work<br />
on <strong>the</strong> exact input for which <strong>the</strong> memory trace was generated. Fur<strong>the</strong>rmore, to ensure<br />
that memory will be located in <strong>the</strong> same place between runs, <strong>the</strong> method relies on <strong>the</strong> use<br />
13
An Introduction to Automatic Memory Partitioning<br />
<strong>of</strong> custom memory allocators, ra<strong>the</strong>r than traditional functions such as malloc() that<br />
intentionally randomize memory allocation locations for security reasons. Since such<br />
allocators allocate memory in a predefined, predictable manner that is persistent between<br />
runs, a program using <strong>the</strong>se allocators can also be correctly analysed using static analysis.<br />
This weakens <strong>the</strong> claim that, by using dynamic analysis techniques, Automatic Memory<br />
Partitioning achieves results that are not obtainable using static methods.<br />
Ano<strong>the</strong>r point <strong>of</strong> discussion is <strong>the</strong> reported results. As discussed in Section 4, results<br />
were ga<strong>the</strong>red by syn<strong>the</strong>sizing a collection <strong>of</strong> sample programs with a single memory<br />
bank, and comparing performance with <strong>the</strong> same programs syn<strong>the</strong>sized with multiple<br />
memory banks. Clearly, <strong>the</strong> programs syn<strong>the</strong>sized with multiple memory banks outperformed<br />
those with single memory banks. This is more a pro<strong>of</strong> that <strong>the</strong> method works,<br />
ra<strong>the</strong>r than that it works well. Far more interesting would have been a comparison between<br />
sample programs optimized with Automatic Memory Partiotioning with those optimized<br />
using o<strong>the</strong>r methods in <strong>the</strong> Literature, such as MPADS. Fur<strong>the</strong>rmore, a number<br />
<strong>of</strong> <strong>the</strong> samples that were used from <strong>the</strong> CUDA SDK are already hand-optimized to use<br />
multiple (shared) memory banks. Syn<strong>the</strong>sizing <strong>the</strong>se to use a single memory bank would<br />
involve significant modifications to <strong>the</strong> original source code, with <strong>the</strong> explicit goal <strong>of</strong> reducing<br />
performance. When syn<strong>the</strong>sized for use with multiple memory banks, did <strong>the</strong>y<br />
use <strong>the</strong> modified, single-bank code, or <strong>the</strong> original SDK sample, written with a multiple<br />
memory bank archtecture in mind? In <strong>the</strong> paper, this is not clear.<br />
In addition to <strong>the</strong> performance <strong>of</strong> <strong>the</strong> syn<strong>the</strong>sized application, <strong>the</strong> performance <strong>of</strong> <strong>the</strong><br />
Automatic Memory Partioning procedure itself is also <strong>of</strong> interest. Discussion <strong>of</strong> this is<br />
largely left out <strong>of</strong> <strong>the</strong> original paper. Both <strong>the</strong> major phases <strong>of</strong> Automatic Memory Paritioning<br />
- memory partitioning and assignment <strong>of</strong> memory regions to available memory<br />
banks - can potentially be slow under <strong>the</strong> right circumstances. The partitioning <strong>of</strong> data<br />
structures relies on <strong>the</strong> use <strong>of</strong> execution traces, which could potentially become very large,<br />
particulalry for applications that process large amounts <strong>of</strong> data and contain frequent datadependent<br />
branching. The authors <strong>of</strong> <strong>the</strong> MPADS method (described in Section 2) explicitly<br />
state <strong>the</strong> importance <strong>of</strong> avoiding execution traces when performance is a concern [5].<br />
Fur<strong>the</strong>rmore, when <strong>the</strong> number <strong>of</strong> identified LMPs becomes large, <strong>the</strong> task <strong>of</strong> assigning<br />
memory banks becomes increasingly complex. In Automatic Memory Partitioning, this<br />
task is formulated as an ILP problem and solved using a heuristic solver. They reported<br />
speeds <strong>of</strong> under a second for a set <strong>of</strong> 10 LMPs. It would be interesting to see performance<br />
for larger LMP sets. Fur<strong>the</strong>rmore, it would be interesting to know how many LMPs can<br />
be expected when syn<strong>the</strong>sising larger programs.<br />
One advantage <strong>of</strong> using memory traces as <strong>the</strong> sole basis for memory analysis is <strong>the</strong> relative<br />
simplicity <strong>of</strong> <strong>the</strong> method. Static techniques <strong>of</strong>ten need to employ complex, language<br />
dependent pointer analysis, with additional measures for type-unsafe languages such as<br />
C and C++. By analysing <strong>the</strong> memory trace, ra<strong>the</strong>r than <strong>the</strong> code itself, <strong>the</strong>se complex<br />
methods can be avoided. Moreover, using memory traces allows <strong>the</strong> memory analysis<br />
method to be more language independent; Automatic Memory Partitioning can easily be<br />
used for any language that can be instrumented to generate suitable memory trace logs.<br />
On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> generation and analysis <strong>of</strong> memory traces can be a cumbersome<br />
14
Erik Bonner<br />
process, since <strong>the</strong>y can become very large.<br />
6 Conclusion<br />
This paper introduced a technique for automatically partitioning data structures across<br />
multiple memory banks on embedded devices such as FPGAs, which enhances application<br />
performance by increasing memory parallelism.<br />
After using a number <strong>of</strong> simple examples in Section 1 to illustrate <strong>the</strong> advantages<br />
<strong>of</strong> memory partitioning on architectures with simultaneously accessible memory banks, a<br />
number <strong>of</strong> relevant data partitioning methods currently existing in <strong>the</strong> Literature were discussed<br />
in Section 2. Although a number <strong>of</strong> <strong>the</strong> existing methods show promising results,<br />
all rely on static code analysis to identify memory partitioning opportunities. Following<br />
<strong>the</strong> literature review, Section 3 moved on to introduce a memory optimization technique<br />
that uses dynamic analysis: Automatic Memory Partitioning. Automatic Memory Partioning<br />
indentifies a target program’s memory access patterns by analysing its memory<br />
trace. Once a set <strong>of</strong> non-interfering memory access patterns have been identified, <strong>the</strong>y<br />
are assigned to a set <strong>of</strong> memory banks, taking care to minimize <strong>the</strong> number <strong>of</strong> banks used<br />
while maximimizing data parallelism. The results reported by <strong>the</strong> authors <strong>of</strong> <strong>the</strong> technique<br />
were given in Section 4. Finally, Section 5 <strong>of</strong>fered a critical disussion <strong>of</strong> <strong>the</strong> Automatic<br />
Memory Partitioning techique, evaluating its strengths and weaknesses.<br />
References<br />
[1] E. Anderson, Z. Bai, C. Bisch<strong>of</strong>, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,<br />
A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’<br />
Guide. Society for Industrial and Applied Ma<strong>the</strong>matics, Philadelphia, PA, third edition,<br />
1999.<br />
[2] Y. Ben-Asher and N. Rotem. Syn<strong>the</strong>sis for variable pipelined function units. In<br />
System-on-Chip, 2008. SOC 2008. International Symposium on, pages 1–4, nov.<br />
2008.<br />
[3] Yosi Ben-Asher and Nadav Rotem. Automatic memory partitioning: increasing<br />
memory parallelism via data structure partitioning. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> eighth<br />
IEEE/ACM/IFIP international conference on <strong>Hardware</strong>/s<strong>of</strong>tware codesign and system<br />
syn<strong>the</strong>sis, CODES/ISSS ’10, pages 155–162, New York, NY, USA, 2010. ACM.<br />
[4] G. J. Chaitin. Register allocation & spilling via graph coloring. SIGPLAN Not.,<br />
17:98–101, June 1982.<br />
[5] Stephen Curial, Peng Zhao, Jose Nelson Amaral, Yaoqing Gao, Shimin Cui, Raul<br />
Silvera, and Roch Archambault. Mpads: memory-pooling-assisted data splitting. In<br />
15
An Introduction to Automatic Memory Partitioning<br />
<strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 7th international symposium on Memory management, ISMM ’08,<br />
pages 101–110, New York, NY, USA, 2008. ACM.<br />
[6] M. R. Garey and D. S. Johnson. The complexity <strong>of</strong> near-optimal graph coloring. J.<br />
ACM, 23:43–49, January 1976.<br />
[7] Chris Lattner and Vikram Adve. Automatic Pool Allocation: Improving Performance<br />
by Controlling Data Structure Layout in <strong>the</strong> Heap. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2005<br />
ACM SIGPLAN Conference on Programming Language Design and Implementation<br />
(PLDI’05), Chigago, Illinois, June 2005.<br />
[8] NVIDIA. Nvidia cuda sdk, 2011.<br />
[9] Peng Zhao, Shimin Cui, Yaoqing Gao, Raúl Silvera, and José Nelson Amaral. Forma:<br />
A framework for safe automatic array reshaping. ACM Trans. Program. Lang. Syst.,<br />
30, November 2007.<br />
16
Error Detection Technique and its Optimization for<br />
Real-Time Embedded Systems<br />
Wei Cao<br />
University <strong>of</strong> Paderborn<br />
wcao@mail.upb.de<br />
January, 12 2012<br />
Abstract<br />
This paper discusses error detection techniques and <strong>the</strong> optimization <strong>of</strong> error detection<br />
implementation(EDI) in <strong>the</strong> context <strong>of</strong> different FPGAs, including FPGA<br />
with static configuration and FPGA with partial dynamic reconfiguration(PDR). In<br />
<strong>the</strong> error detection techniques, path tracking and variable checking are <strong>the</strong> main<br />
sources <strong>of</strong> performance overhead. According to <strong>the</strong>ir different implementation ways,<br />
<strong>the</strong>re are three basic error detection implementations: s<strong>of</strong>tware-only(SW-only) approach,<br />
in which both path tracking and variable checking are implemented in s<strong>of</strong>tware;<br />
mixed s<strong>of</strong>tware/hardware(mixed SW/HW) approach, in which path tracking<br />
leading to significant time overhead is moved into hardware and variable checking<br />
remains in s<strong>of</strong>tware; hardware-only(HW-only) approach, in which both <strong>of</strong> <strong>the</strong>m are<br />
performed in hardware. This paper introduces error detection approaches based on<br />
<strong>the</strong>se basic error detection implementations and discusses <strong>the</strong>m in detail. Fur<strong>the</strong>r<br />
more, considering <strong>the</strong> fact that an application normally consists <strong>of</strong> a number <strong>of</strong> processes,<br />
error detection can be optimized through applying it to every process, i.e.<br />
to achieve <strong>the</strong> efficient implementation <strong>of</strong> error detection through <strong>the</strong> refinement <strong>of</strong><br />
error detection. Therefore, two optimization algorithms are presented in this paper<br />
as well. One optimization algorithm focuses on <strong>the</strong> case <strong>of</strong> FPGA supporting only<br />
static configuration and <strong>the</strong> o<strong>the</strong>r one, on <strong>the</strong> case <strong>of</strong> FPGA supporting PDR. The<br />
improvement after <strong>the</strong> optimization will be shown through experimental results.<br />
1 Introduction<br />
Errors are always unavoidable in any system. If errors are not detected in time, <strong>the</strong>y can<br />
cause result deviations, even program crashes. In <strong>the</strong> context <strong>of</strong> errors, to detect errors is<br />
<strong>the</strong> only possibility to guarantee <strong>the</strong> effectiveness <strong>of</strong> an application’s execution. Therefore,<br />
error detection is indispensable for any system, especially for real-time systems, in<br />
which errors should be not only effectively detected, but also efficiently detected. To<br />
18
Wei Cao<br />
achieve this goal, many error detection techniques have been developed. Each error detection<br />
technique ei<strong>the</strong>r causes a certain number <strong>of</strong> time overheads, or requires a certain<br />
number <strong>of</strong> hardware resources, for some techniques even both. In real-time systems, each<br />
application has a deadline. Because <strong>of</strong> <strong>the</strong> existence <strong>of</strong> deadline, for error detection in<br />
real-time systems, time overhead is a more important factor than hardware cost. Consequently,<br />
<strong>the</strong> time for error detection should be minimized in order to satisfy <strong>the</strong> deadline<br />
<strong>of</strong> <strong>the</strong> application. There are various ways to optimize <strong>the</strong> time for error detection. To determine<br />
an appropriate error detection implementation for each process <strong>of</strong> <strong>the</strong> application<br />
in an intelligent manner is regarded as a considerable way.<br />
The main focus <strong>of</strong> this paper is <strong>the</strong> systematic discussion about error detection technique,<br />
including corresponding approaches and <strong>the</strong> explanation <strong>of</strong> an approach to <strong>the</strong> optimization<br />
<strong>of</strong> error detection implementation.<br />
2 Error Detection Technique<br />
Although <strong>the</strong> traditional approach <strong>of</strong> error detection, so called “one-size-fits-all” approach,<br />
is capable <strong>of</strong> providing a certain error coverage, this error coverage sometimes<br />
can be ra<strong>the</strong>r low and can not meet <strong>the</strong> expected requirements, i.e. <strong>the</strong> traditional approach<br />
is not able to supply sufficient reliability. Since every application has its own<br />
characteristics, <strong>the</strong> reliability provided by error detection can be dramatically improved<br />
if EDI for a specific application can be adjusted according to <strong>the</strong>se characteristics. To<br />
take full advantage <strong>of</strong> characteristics <strong>of</strong> each application, application-aware technique has<br />
been developed.<br />
2.1 Working Principle<br />
The purpose <strong>of</strong> application-aware technique is to improve <strong>the</strong> reliability <strong>of</strong> an application<br />
with <strong>the</strong> help <strong>of</strong> its characteristics, as stated above. Then, next question occurs: How to<br />
implement error detection in application-aware technique? The answer is as follows:<br />
1. The first step is to identify critical variables in a program. A critical variable is<br />
defined as "a program variable that exhibits high sensitivity to random data errors<br />
in <strong>the</strong> application"[6].<br />
2. Once critical variables have been identified, backward program slice, defined as "<strong>the</strong><br />
set <strong>of</strong> all program statements/instructions that can affect <strong>the</strong> value <strong>of</strong> <strong>the</strong> variable at<br />
a program location"[8], can be extracted as <strong>the</strong> second step.<br />
3. After <strong>the</strong> extraction <strong>of</strong> backward program slice, checking expressions are generated<br />
during <strong>the</strong> optimization <strong>of</strong> each slice at compile time. These expressions are<br />
<strong>the</strong>n inserted into <strong>the</strong> original code and will be chosen by checking instructions to<br />
compare <strong>the</strong> results.<br />
19
Error Detection Technique and its Optimization for Real-Time Embedded Systems<br />
Thus, along with <strong>the</strong> execution <strong>of</strong> <strong>the</strong> original code, instructions for tracking control paths<br />
and <strong>the</strong> checking expressions are utilized to implement error detection. The above three<br />
steps are <strong>the</strong> brief introduction for <strong>the</strong> principle <strong>of</strong> <strong>the</strong> application-aware technique. More<br />
details will be explained in a subsequent section(see Section 2.3.1).<br />
2.2 Error Detection Implementations<br />
In this paper, only transient faults are considered. Therefore, path tracking and variable<br />
checking can be implemented ei<strong>the</strong>r in s<strong>of</strong>tware, potentially resulting in high overheads,<br />
or in hardware, possibly exceeding <strong>the</strong> amount <strong>of</strong> available hardware resources. Based on<br />
different implementation combinations <strong>of</strong> path tracking and variable checking, <strong>the</strong>re are<br />
three types <strong>of</strong> error detection implementations:<br />
• SW-only: In <strong>the</strong> SW-only implementation, both path tracking and variable checking<br />
are implemented in s<strong>of</strong>tware. Compared with variable checking, path tracking<br />
causes significant time overheads while implemented in s<strong>of</strong>tware. Hence, <strong>the</strong> time<br />
cost overheads <strong>of</strong> SW-only implementation is numerous and <strong>the</strong> maximum among<br />
all <strong>the</strong> error detection implementations as well. Also because all error detection are<br />
implemented in s<strong>of</strong>tware, almost no hardware resource is needed.<br />
• HW-only: In <strong>the</strong> HW-only implementation, both path tracking and variable checking<br />
are performed in hardware. Thus, <strong>the</strong> time overheads decrease observably. But<br />
<strong>the</strong> disadvantage brought by hardware implementation is ra<strong>the</strong>r obvious as well:<br />
A huge amount <strong>of</strong> hardware is required, sometimes even beyond <strong>the</strong> amount <strong>of</strong><br />
available hardware resources.<br />
• Mixed SW/HW: Since path tracking causes significant time overhead, moving it<br />
into hardware becomes a natural way to reduce <strong>the</strong> general overhead drastically.<br />
After this movement, path tracking is <strong>the</strong>n performed in parallel with <strong>the</strong> execution<br />
<strong>of</strong> <strong>the</strong> application and as a result, plenty <strong>of</strong> time cost can be saved. Checking expressions<br />
for critical variables remain in s<strong>of</strong>tware, so <strong>the</strong> requirement for hardware<br />
in <strong>the</strong> mixed SW/HW implementation is not as much as HW-only implementation.<br />
To some degree, mixed SW/HW implementation can be regarded as a composition<br />
absorbing <strong>the</strong> advantages <strong>of</strong> SW-only implementation and HW-only implementation.<br />
Just because <strong>of</strong> <strong>the</strong> existence <strong>of</strong> <strong>the</strong>se basic error detection implementations, error detection<br />
approaches(see Section 2.3) and <strong>the</strong> optimization <strong>of</strong> error detection implementation(see<br />
Section 3) can be realized.<br />
2.3 Error Detection Approaches<br />
In this section, two extreme error detection approaches are discussed: complete SW-only<br />
approach and complete HW-only approach. In both complete approaches, all error detection<br />
are implemented in s<strong>of</strong>tware or performed in hardware. Given that <strong>the</strong> principle <strong>of</strong><br />
20
Wei Cao<br />
<strong>the</strong> path tracking in mixed SW/HW approach is similar with <strong>the</strong> one in complete HW-only<br />
approach and likewise, <strong>the</strong> principle <strong>of</strong> <strong>the</strong> variable checking in mixed SW/HW approach<br />
is similar with <strong>the</strong> one in complete SW-only approach, <strong>the</strong> mixed SW/HW approach is<br />
not discussed here considering principle convergence.<br />
2.3.1 Complete SW-Only Approach<br />
An approach to derive error detectors using static analysis[1] <strong>of</strong> an application is presented<br />
in [6]. Detector is defined as "<strong>the</strong> set <strong>of</strong> all checking expressions for a critical variable,<br />
one for each acyclic, intraprocedural control path in <strong>the</strong> program"[6]. The main steps <strong>of</strong><br />
deriving error detectors are described as follows:<br />
1. Identify critical variables in <strong>the</strong> program. Critical variables are program variables<br />
with <strong>the</strong> highest fan-outs (defined as <strong>the</strong> number <strong>of</strong> forward dependencies). These<br />
variables are <strong>of</strong> prime importance, as <strong>the</strong>ir errors can propagate to many locations in<br />
<strong>the</strong> program and result in program failure. If <strong>the</strong>se variables can be protected, a bigger<br />
error coverage can be achieved. The approach for identifying critical variables<br />
can be found in [5].<br />
2. Compute <strong>the</strong> backward program slice <strong>of</strong> critical variables. Started with <strong>the</strong> instruction<br />
that computes <strong>the</strong> value <strong>of</strong> critical variables, <strong>the</strong> static dependence graph <strong>of</strong><br />
<strong>the</strong> program is traversed backwards to <strong>the</strong> beginning <strong>of</strong> <strong>the</strong> function. The backward<br />
program slice is specialized for each acyclic control path and it consists <strong>of</strong> <strong>the</strong><br />
instructions that can legally modify <strong>the</strong> critical variables.<br />
3. Generate checking expressions through <strong>the</strong> optimization <strong>of</strong> <strong>the</strong> backward slice <strong>of</strong><br />
<strong>the</strong> critical variables. These checking expressions are inserted into <strong>the</strong> program<br />
immediately after <strong>the</strong> computation <strong>of</strong> <strong>the</strong> critical variable. In order to choose <strong>the</strong><br />
corresponding checking expressions for each control path, program is instrumented<br />
with tracking instructions to track control paths.<br />
4. Check at runtime. At runtime, <strong>the</strong> corresponding checks are performed at appropriate<br />
points, while each control path is tracked. When checks are executed, <strong>the</strong>y<br />
recompute <strong>the</strong> value <strong>of</strong> critical variable and <strong>the</strong>n compare this value with <strong>the</strong> value<br />
computed by <strong>the</strong> original program. If <strong>the</strong>se values do not match, <strong>the</strong> original program<br />
stops and initiates <strong>the</strong> recovery.<br />
2.3.2 Complete HW-Only Approach<br />
The technique mentioned in Section 2.3.1 is called Critical Variable Recomputation(CVR)<br />
technique. Compared with <strong>the</strong> complete s<strong>of</strong>tware implementation <strong>of</strong> CVR in Section<br />
2.3.1, <strong>the</strong> approach to be explained in this section is <strong>the</strong> hardware implementation <strong>of</strong> <strong>the</strong><br />
CVR technique. The approach is introduced in [4]. The core part <strong>of</strong> this approach is <strong>the</strong><br />
Static Detector Module(SDM), which consists <strong>of</strong> a path tracking submodule, a checking<br />
submodule and if necessary, an argument buffer called ARGQ, as shown in Figure 1.<br />
21
1) and convert it into • leaveFunc: This is invoked whenever program execu-<br />
view <strong>of</strong> <strong>the</strong> main protion returns from a function. The state machines are<br />
s analogous to a no- restored to <strong>the</strong>ir previous states, which are popped <strong>of</strong>f<br />
truction has a unique Error Detection <strong>of</strong> <strong>the</strong> Technique StateStack. and its Optimization for Real-Time Embedded Systems<br />
RSE module and op-<br />
Checking. The Checking submodule performs recomodule<br />
ARGQ can buffer putation dataoperations supplied by in parallel an SDM-protected with <strong>the</strong> program application execution. in order to support<br />
recomputation. The path tracking tracks <strong>the</strong> control path and indicates which instruction<br />
) is <strong>the</strong> hardware imescribed<br />
in Section 3.<br />
ts <strong>of</strong> two submodules:<br />
nd (2) <strong>the</strong> Checking<br />
o both submodules is<br />
. If necessary, an arfers<br />
data supplied by<br />
r to support recompunce<br />
all values necese<br />
ARGQ is accessed<br />
allows <strong>the</strong> SDM to<br />
quiring fur<strong>the</strong>r infor-<br />
Leon3<br />
check<br />
emitEdge,<br />
enterFunc,<br />
leaveFunc<br />
args<br />
Static Detector Module<br />
Checking<br />
Submodule<br />
path<br />
Path Tracking<br />
Submodule<br />
StateStack<br />
ARGQ<br />
Figure 2. Figure Static 1: Detector Static Detector Module Module[4] block diagram<br />
is being executed in order to supply <strong>the</strong> information that which operations should be<br />
recomputed subsequently. This submodule consists <strong>of</strong> hardware state machines and a<br />
stack structure, StateStack. Each state machine corresponds to a particular check and is<br />
constantly updated during program execution. For each state machine, a corresponding<br />
stack is set up in <strong>the</strong> StateStack. Therefore, <strong>the</strong> StateStack is <strong>the</strong> set <strong>of</strong> individual stacks.<br />
The benefit <strong>of</strong> such a structure in <strong>the</strong> StateStack is that <strong>the</strong> overhead for accessing <strong>the</strong><br />
stack is minimized, because each stack can be accessed with o<strong>the</strong>r stacks parallel. Three<br />
types <strong>of</strong> CHK instructions which are viewed as analogous to a no-operation instruction,<br />
are recognized by <strong>the</strong> path tracking submodule:<br />
• emitEdge(src,dest): This instruction is needed in <strong>the</strong> case <strong>of</strong> branches during <strong>the</strong><br />
program execution. Both <strong>of</strong> its arguments, src and dest, are inserted into <strong>the</strong> buffer<br />
ARGQ and according to <strong>the</strong>se arguments, <strong>the</strong> state machines for path tracking are<br />
updated.<br />
• enterFunc: This instruction is involved when program enters a function. In this<br />
case, <strong>the</strong> current states <strong>of</strong> state machines are pushed into <strong>the</strong> StateStack.<br />
• leaveFunc: Corresponding to <strong>the</strong> instruction enterFunc, leaveFunc is involved when<br />
program leaves a function. In this case, <strong>the</strong> states stored in <strong>the</strong> StateStack pop out <strong>of</strong><br />
<strong>the</strong> StateStack. The state machines are, <strong>the</strong>refore, recovered to <strong>the</strong> previous states.<br />
Checking submodule is responsible for recomputing in parallel with program execution<br />
and finding out when to recompute. Different from path tracking submodule, only one<br />
type <strong>of</strong> CHK instruction is recognized by checking submodule:<br />
22
if(path==1)<br />
Wei Cao<br />
x′ = w; x′ = s-2*t;<br />
<strong>the</strong>n<br />
<strong>the</strong>n else<br />
if(x′==x)<br />
else<br />
flag error and<br />
recover!<br />
agment with detectors<br />
ated via <strong>the</strong> instrumentation code<br />
puted by <strong>the</strong> original program) is<br />
omputed by <strong>the</strong> checking expresrror<br />
flag is raised and a recovery<br />
ain sources <strong>of</strong> performance overecking.<br />
In <strong>the</strong> context <strong>of</strong> transient<br />
mented ei<strong>the</strong>r in s<strong>of</strong>tware, potenoverheads,<br />
or in hardware, which<br />
ing <strong>the</strong> amount <strong>of</strong> resources.<br />
sed a s<strong>of</strong>tware-only, straightfore<br />
path tracking and <strong>the</strong> variable<br />
ware and executed toge<strong>the</strong>r with<br />
ath tracking alone incurs a time<br />
e overhead due to variable checkardware<br />
implementations <strong>of</strong> path<br />
are proposed in [15] and [14]. In<br />
s <strong>of</strong> implementing all error detecnd<br />
performing it in hardware, on<br />
e <strong>of</strong> possible alternatives characentation<br />
decision taken for each<br />
cision depends on various factors<br />
• check(num): This instruction is involved when a check needs to be done. The<br />
argument num indicates <strong>the</strong> ID <strong>of</strong> <strong>the</strong> check to be performed. As shown in Figure 1,<br />
<strong>the</strong> checking submodule receives <strong>the</strong> output <strong>of</strong> <strong>the</strong> path tracking submodule. Then<br />
with <strong>the</strong> help <strong>of</strong> this output, <strong>the</strong> checking submodule executes <strong>the</strong> appropriate check.<br />
3 Optimization <strong>of</strong> Error Detection Implementation<br />
In <strong>the</strong> above section, error detection approaches are elaborated. They provide error detection<br />
for applications with some time cost under <strong>the</strong> limitation <strong>of</strong> hardware resources. But<br />
for real-time systems, <strong>the</strong>re is an extra requirement for time: The execution <strong>of</strong> an application<br />
along with error detection must be finished before its deadline. In consideration <strong>of</strong><br />
this point, error detection has to be accelerated, i.e. error detection has to be optimized<br />
in order to reduce <strong>the</strong> entire execution time. Is <strong>the</strong>re any possibility to accelerate error<br />
detection? How can efficient implementation <strong>of</strong> error detection be achieved? The general<br />
idea for questions mentioned above is to determine an appropriate error detection implementation<br />
for each process in <strong>the</strong> application according to various factors. In this section,<br />
all relevant concerns about optimization will be explained. First <strong>of</strong> all, <strong>the</strong> general framework<br />
for optimization will be illustrated. System model will be explained next. At last,<br />
two optimization algorithms will be given to show how error detection implementations<br />
can be optimized.<br />
3.1 Optimization Framework<br />
C code<br />
Error detection<br />
instrumentation and<br />
overheads estimation<br />
Process<br />
graphs<br />
Overheads<br />
WCSL<br />
Mapping<br />
HW Architecture<br />
Optimization <strong>of</strong><br />
error detection<br />
implementation<br />
Fault-tolerant<br />
schedule syn<strong>the</strong>sis<br />
(cost function)<br />
ery overheads for each process, <strong>the</strong> architecture on which this application<br />
is mapped and <strong>the</strong> maximum number <strong>of</strong> faults that could<br />
Figure 2: Framework Overview[3]<br />
affect <strong>the</strong> system during one period. As an output it produces schedule<br />
tables that capture <strong>the</strong> alternative execution scenarios corresponding<br />
to possible fault occurrences.<br />
Among all fault scenarios <strong>the</strong>re exists one which corresponds to<br />
<strong>the</strong> worst-case in terms <strong>of</strong> schedule length. In <strong>the</strong> rest <strong>of</strong> <strong>the</strong> paper,<br />
we are interested in this worst-case schedule length (WCSL), which<br />
has to satisfy <strong>the</strong> imposed application deadline.<br />
In this context, our fault model assumes that a maximum number<br />
k <strong>of</strong> transient faults can affect <strong>the</strong> system during one period. To<br />
provide resiliency against <strong>the</strong>se faults re-execution is used. Once a<br />
fault is detected by <strong>the</strong> error detection technique, <strong>the</strong> initial state <strong>of</strong><br />
<strong>the</strong> process is restored and <strong>the</strong> process is re-executed.<br />
The above mentioned scheduling technique considers error detection<br />
as a black box. In this paper, we will try to minimize <strong>the</strong> WCSL<br />
<strong>of</strong> <strong>the</strong> application, by accelerating error detection in reconfigurable<br />
hardware in an intelligent manner, so that we meet <strong>the</strong> time and cost<br />
constraints imposed to our system. 23<br />
Figure 2 shows an overview <strong>of</strong> <strong>the</strong> general framework. The component emphasized<br />
in bold is <strong>the</strong> optimization framework represented in this section. The function <strong>of</strong> each<br />
component, including <strong>the</strong> optimization framework, is explained in <strong>the</strong> below. The goal<br />
is to minimize <strong>the</strong> worst-case schedule length(WCSL) <strong>of</strong> <strong>the</strong> application under hardware<br />
constraints.<br />
• C code: represents <strong>the</strong> initial application.<br />
2.3 Optimization Framework<br />
In Figure 2 we present an overview <strong>of</strong> our framework. The initial<br />
applications, available as C code, are represented as a set <strong>of</strong> process<br />
graphs. The code is processed through <strong>the</strong> error detection instru
Error Detection Technique and its Optimization for Real-Time Embedded Systems<br />
• Process graphs: can be obtained from <strong>the</strong> initial application and specifies <strong>the</strong> privilege<br />
relationship among all processes.<br />
• Error detection instrumentation framework: processes <strong>the</strong> initial application code<br />
by embedding error detectors into code and estimates <strong>the</strong> time overheads and hardware<br />
costs using <strong>the</strong> instrumented code.<br />
• Optimization framework: takes process graphs, overheads computed by error detection<br />
instrumentation framework, <strong>the</strong> mapping <strong>of</strong> processes to computation nodes<br />
and <strong>the</strong> system hardware architecture as its input. As <strong>the</strong> output, optimization<br />
framework produces an error detection implementation which is closer to optimal<br />
one.<br />
• Fault-tolerant schedule syn<strong>the</strong>sis tool: generates <strong>the</strong> worst-case schedule length(WCSL)<br />
as cost function according to <strong>the</strong> optimization result. More details about this tool<br />
will be explained in Section 3.2.<br />
3.2 Syn<strong>the</strong>sis <strong>of</strong> Fault-Tolerant Schedules<br />
In [2] an approach to <strong>the</strong> generation <strong>of</strong> fault-tolerant schedules is proposed. The input<br />
<strong>of</strong> <strong>the</strong> algorithm consists <strong>of</strong> a corresponding process graph obtained from <strong>the</strong> application,<br />
<strong>the</strong> worst-case execution time (WCET) <strong>of</strong> processes, <strong>the</strong> worst-case transmission time<br />
(WCTT) <strong>of</strong> messages, <strong>the</strong> error detection and recovery overheads for each process, <strong>the</strong><br />
architecture on which this application is mapped and <strong>the</strong> maximum number <strong>of</strong> faults that<br />
could affect <strong>the</strong> system during one period. The corresponding output <strong>of</strong> <strong>the</strong> algorithm is<br />
schedule tables taking possible execution scenarios with possible fault occurrences into<br />
account. In some certain fault scenario, <strong>the</strong> schedule length can be <strong>the</strong> worst compared<br />
with o<strong>the</strong>r scenarios. The schedule length in this scenario is called <strong>the</strong> worst-case schedule<br />
length(WCSL), which must meet <strong>the</strong> deadline <strong>of</strong> <strong>the</strong> application.<br />
3.3 System Model<br />
Taking from [3], "a set <strong>of</strong> real-time applications Ai is considered, modeled as acyclic directed<br />
graphs Gi(Vi, Ei) and executed with period Ti. The graphs Gi are merged into a<br />
single graph G(V, E), having <strong>the</strong> period T equal with <strong>the</strong> least common multiple <strong>of</strong> all Ti.<br />
This graph corresponds to a virtual application A. Each vertex Pj ∈ V represents a process,<br />
and each edge ejk ∈ E, from Pj to Pk, indicates that <strong>the</strong> output <strong>of</strong> Pj is an input for<br />
Pk. Processes are non-preemptable and all data dependencies have to be satisfied before<br />
a process can start executing. A global deadline D is considered , representing <strong>the</strong> time<br />
interval during which <strong>the</strong> application A has to finish."<br />
Figure 3 gives an intuitional understanding <strong>of</strong> <strong>the</strong> system model. P1 to P4 are four processes<br />
in an application and m1 to m2 are two messages sent from a process to ano<strong>the</strong>r.<br />
Figure 3c and e show <strong>the</strong> distributed architecture, on which <strong>the</strong> application runs. It is<br />
composed <strong>of</strong> a set <strong>of</strong> computation nodes, connected to a bus. In Figure 3a, b and d, <strong>the</strong><br />
24
Wei Cao<br />
processes are mapped to <strong>the</strong>se nodes and <strong>the</strong> mapping is illustrated with shading. Each<br />
node consists <strong>of</strong> a central processing unit, a communication controller, a memory subsystem,<br />
and also includes a reconfigurable Error device Detection (FPGA). Implementation For all <strong>the</strong> messages sent over<br />
<strong>the</strong> bus (between Proc. processes WCET mapped SW-only on different Mixed HW/SW computationHW-only nodes), <strong>the</strong>ir worst-case<br />
transmission time (WCTT) is given. Such a transmission is modeled as a communication<br />
process inserted on <strong>the</strong> edge connecting <strong>the</strong> sender and <strong>the</strong> receiver process.<br />
Here three error detection implementations(see Section 2.2) are considered for each process<br />
in <strong>the</strong> application. Every error detection implementation is possible to be selected<br />
and applied to any process.<br />
P1 a)<br />
P1 b)<br />
P3 m2 P3 U<br />
Table 1. WCET and overheads<br />
N1 P1<br />
a) N2 P3<br />
WCETi hi ρi WCETi hi ρi WCETi hi ρi<br />
bus<br />
P1 60 240 0 0 100 15 20 80 40 45<br />
N1 P1 P2<br />
P2 50 140 0 0 80 15 20 60 40 45<br />
P3 40 150 0 0 60 10 15 50 30 35 b) N2 P3<br />
P4 30 100 0 0 60 15 20 40 40 45 bus<br />
P1 d)<br />
P3 P1 P2<br />
P2 m1 P4 P2 m1 P 4<br />
FPGA1 FPGA2<br />
FPGA1<br />
bus<br />
c) N1 N2<br />
N1<br />
Bus<br />
e)<br />
Bus<br />
Figure 3. System model<br />
to be fault-tolerant Figure (i.e. we 3: use System a communication Model[3] protocol such as<br />
N1<br />
d) N2<br />
P1<br />
P3<br />
rec<br />
P1<br />
TTP [12]). Each node is composed <strong>of</strong> a central processing unit, a<br />
communication controller, a memory subsystem, and also includes a<br />
bus<br />
3.4 EDI Optimization<br />
reconfigurable device (FPGA). Knowing that SRAM-based FPGAs<br />
N1 P1<br />
are susceptible to single event upsets [23], we assume that suitable<br />
Based on different mitigation characteristics techniques between are employed FPGA(e.g. with [13]) static in order reconfiguration to provide and FPGA e) N2 P3<br />
P1 P2<br />
with PDR capabilities, sufficient reliability two alternative <strong>of</strong> <strong>the</strong> hardware optimization used for solutions error detection. based on a Tabu Search bus<br />
heuristic[7] will For beeach proposed process respectively. we consider three Before alternative <strong>the</strong> description implementations <strong>of</strong> optimization <strong>of</strong> algorithms<br />
in Section error 3.4.4 detection and (EDIs): SectionSW-only, 3.4.5, some mixed concepts HW/SW inside and HW-only. <strong>the</strong> algorithms For are intro- Fig<br />
duced first in <strong>the</strong> <strong>the</strong>SW-only Section 3.4.1, alternative, Section<strong>the</strong> 3.4.2 checking and Section code (illustrated 3.4.3. with light detection. For exam<br />
shading in Figure 1) and <strong>the</strong> path tracking instrumentation (illus- SW-only EDI for pr<br />
trated with dark shading in Figure 1) are implemented in s<strong>of</strong>tware costs (hi) and <strong>the</strong> rec<br />
3.4.1 Moves<br />
and interleaved with <strong>the</strong> actual code <strong>of</strong> <strong>the</strong> application. Since <strong>the</strong> tive EDI, for each<br />
Two types <strong>of</strong>time moves overhead will be<strong>of</strong> mentioned path tracking in is <strong>the</strong>significant, algorithms a natural : simple refinement moves and <strong>of</strong> swaps. WCTT A <strong>of</strong> messages<br />
simple move<strong>the</strong> applied technique to a is process to place is <strong>the</strong> defined path tracking as <strong>the</strong> transition instrumentation from one in hard- error detection for all processes (s<br />
implementation ware, to and, any <strong>of</strong> thus, <strong>the</strong>drastically adjacent reduce ones from its overhead. <strong>the</strong> ordered This set second H = {SW-only, alterna- mixed- tolerate a number o<br />
HW/SW, HW-only}, tive represents while <strong>the</strong> a swap mixed consists HW/SW <strong>of</strong> solution, two “opposite” in which simple <strong>the</strong> path moves, track- concerning represent process ex<br />
two processes ing mapped is moved onto <strong>the</strong> hardware same computation and done concurrently node. In <strong>the</strong> with case <strong>the</strong> <strong>of</strong> execution swap, because with <strong>of</strong> dark shading<br />
<strong>the</strong> hardware<strong>of</strong> limitation <strong>the</strong> application, on eachwhile computation <strong>the</strong> checking node, expressions <strong>the</strong> EDI <strong>of</strong> remain a process in s<strong>of</strong>t- performed<br />
checkerboard<br />
in<br />
pattern<br />
<strong>the</strong> hardwareware, has tointerleaved be movedwith more <strong>the</strong> into initial s<strong>of</strong>tware code. In before order to <strong>the</strong>fur<strong>the</strong>r EDI <strong>of</strong> reduce ano<strong>the</strong>r <strong>the</strong> process<br />
In<br />
is<br />
Figure 4a we p<br />
time overhead, <strong>the</strong> execution <strong>of</strong> <strong>the</strong> checking expressions can also SW-only solution. T<br />
implemented more into hardware.<br />
be moved to hardware (referred as <strong>the</strong> HW-only implementation). s<strong>of</strong>tware for all proc<br />
We assume that for each process its worst-case execution time ules as described in<br />
3.4.2 Selection<br />
(WCET<br />
<strong>of</strong> <strong>the</strong><br />
i) [22]<br />
Best<br />
is known,<br />
Move<br />
for each <strong>of</strong> <strong>the</strong> three possible implementa- schedule length (W<br />
In <strong>the</strong> algorithms, tions <strong>of</strong> <strong>the</strong>error operation detection "select (SW-only, <strong>the</strong> best mixed move" HW/SW also needs and HW-only). to be performed<br />
scenario<br />
and<br />
in which P<br />
"<strong>the</strong> best move" Also, should <strong>the</strong> corresponding be selected HW fromcost/area all <strong>the</strong> possible (hi) and <strong>the</strong> moves. reconfiguration But considering<br />
additional<br />
<strong>the</strong><br />
hardware<br />
time (ρi) needed to implement error detection are known.<br />
(i.e. unlimited rec<br />
For all <strong>the</strong> messages sent over <strong>the</strong> bus (between processes mapped WCSL is obtained b<br />
on different computation nodes), <strong>the</strong>ir worst-case transmission time time units. This mea<br />
(WCTT) is given. Such a transmission is modeled as a communica- error detection for al<br />
tion process inserted on <strong>the</strong> edge connecting <strong>the</strong> sender and <strong>the</strong> FPGA2 should be at<br />
25<br />
receiver process (Figure 3a and b). For processes mapped on <strong>the</strong> are <strong>the</strong> two extreme<br />
same node, <strong>the</strong> communication time is considered to be part <strong>of</strong> <strong>the</strong> no additional hardw<br />
process’ WCET and is not modeled explicitly.<br />
mal hardware cost.<br />
mal WCSL, subject<br />
P 2<br />
P 4<br />
c)<br />
N1<br />
N2<br />
P3
Error Detection Technique and its Optimization for Real-Time Embedded Systems<br />
efficiency <strong>of</strong> <strong>the</strong> algorithms, only moves which can affect <strong>the</strong> processes on <strong>the</strong> critical<br />
path <strong>of</strong> <strong>the</strong> worst-case schedule for <strong>the</strong> current solution, have been explored. When <strong>the</strong><br />
best move needs to be selected, processes on <strong>the</strong> critical path <strong>of</strong> <strong>the</strong> current solution are<br />
first identified, <strong>the</strong>n <strong>the</strong> search for <strong>the</strong> best move can be started according to <strong>the</strong> following<br />
criterion:<br />
1. For <strong>the</strong> moves, <strong>the</strong> simple ones into HW are first explored; if not possible, try <strong>the</strong><br />
swap ones.<br />
2. If <strong>the</strong>re exists moves which is not tabu, irrespective <strong>of</strong> simple ones or swap ones,<br />
select <strong>the</strong> move generating <strong>the</strong> best movement and stop exploring <strong>the</strong> o<strong>the</strong>r moves.<br />
3. If <strong>the</strong> WCSL gets closer to a minimum with <strong>the</strong> help <strong>of</strong> this move, <strong>the</strong>n this move<br />
will be accepted. If no such simple or swap move exists, <strong>the</strong>n <strong>the</strong> search has to be<br />
diversified.<br />
3.4.3 Diversification Strategy<br />
The diversification strategy in <strong>the</strong> algorithms consists <strong>of</strong> continuous diversification strategy<br />
and a restart strategy. The former uses an intermediate-term frequency memory to<br />
guarantee that if a process has not been involved in a move for a long time, it will be<br />
selected to be involved. Complemented to continuous diversification strategy, <strong>the</strong> restart<br />
strategy will restart <strong>the</strong> search process if <strong>the</strong>re’s no improvement <strong>of</strong> <strong>the</strong> best known solution<br />
for a certain number <strong>of</strong> iterations.<br />
3.4.4 EDI with static configuration<br />
Figure 4 shows <strong>the</strong> pseudocode <strong>of</strong> <strong>the</strong> optimization algorithm <strong>of</strong> EDI with static configuration.<br />
The EDI assignment optimization algorithm for FPGA with static reconfiguration<br />
begins with a random initial solution. This solution is considered as <strong>the</strong> current best solution.<br />
Next, <strong>the</strong> WCSL needs to be calculated for evaluation and <strong>the</strong> Tabu list is initialized<br />
as empty. After recording <strong>the</strong> WCSL, as <strong>the</strong> next step <strong>the</strong> algorithm will select <strong>the</strong> best<br />
move among possible moves based on <strong>the</strong> current situation. Later this best move is put<br />
into <strong>the</strong> tabu list and applied to <strong>the</strong> current solution. At this time, <strong>the</strong> current solution gets<br />
updated and once it has been updated, <strong>the</strong> new WCSL needs to be recalculated. According<br />
to <strong>the</strong> comparison <strong>of</strong> WCSL, <strong>the</strong> update <strong>of</strong> <strong>the</strong> current best solution is determined. Now<br />
using <strong>the</strong> diversification strategy, if no improvement occurs for a certain number <strong>of</strong> iterations,<br />
<strong>the</strong> search process is restarted. Generally, if <strong>the</strong> number <strong>of</strong> iterations has reached<br />
<strong>the</strong> maximal number allowed for iterations, <strong>the</strong> algorithm returns <strong>the</strong> current best solution<br />
and stops.<br />
3.4.5 EDI with PDR FPGAs<br />
Since FPGA now supports <strong>the</strong> capability <strong>of</strong> partial dynamic reconfiguration, it’s possible<br />
to overlap <strong>the</strong> execution <strong>of</strong> a process with reconfiguration <strong>of</strong> ano<strong>the</strong>r EDI. Thus, <strong>the</strong><br />
26
tain this by assigning <strong>the</strong> mixed HW/SW<br />
et us now consider an FPGA <strong>of</strong> only 25<br />
R capabilities. In this case (Figure 5c), we<br />
ixed HW/SW implementations for proc-<br />
PGA. Then, as soon as P1 finishes, we can<br />
orresponding to its detector Weimodule Cao and<br />
e mixed HW/SW EDI for P2. This reconrallel<br />
with <strong>the</strong> execution <strong>of</strong> P3, so all <strong>the</strong><br />
can be masked. As a consequence, P2 can<br />
3 finishes. Unfortunately, for P4 we cannot<br />
ith P2’s execution, since we only have 10<br />
we are forced to wait until P2 ends, <strong>the</strong>n<br />
ith P4’s mixed HW/SW detector module<br />
ule P4. Note that, even if <strong>the</strong> reconfiguraot<br />
be masked, we still prefer this solution<br />
e SW-only alternative <strong>of</strong> P4, because we<br />
CETSW-only - (ρmixed HW/SW + WCETmixed e 5c (WCSL = 430 time units) with Figure<br />
nits), we see that, by exploiting PDR capaetter<br />
performance than using static FPGAs<br />
puts an assignment S <strong>of</strong> EDIs to processes, so that <strong>the</strong> WCSL is<br />
minimized and <strong>the</strong> HW cost constraints are met.<br />
The exploration <strong>of</strong> <strong>the</strong> solution space starts from a random initial<br />
solution (line 1). In <strong>the</strong> following, based on a neighborhood search,<br />
successive moves are performed with <strong>the</strong> goal to come as close as<br />
possible to <strong>the</strong> solution with <strong>the</strong> shortest WCSL. The transition from<br />
one solution to ano<strong>the</strong>r is <strong>the</strong> result <strong>of</strong> <strong>the</strong> selection (line 5) and<br />
EDI_Optimization(G, N, M, W, C, k)<br />
1 best_Sol = current_Sol = Random_Initial_Solution();<br />
2 best_WCSL = current_WCSL = WCSL(current_Sol);<br />
3 Tabu = Ø;<br />
4 while (iteration_count < max_iterations) {<br />
5 best_Move = Select_Best_Move(current_Sol, current_WCSL);<br />
6 Tabu = Tabu U {best_Move};<br />
7 current_Sol = Apply(best_Move, current_Sol);<br />
8 current_WCSL = WCSL(current_Sol); Update(best_Sol);<br />
9 if (no_improvement_count > diversification_count)<br />
10 Restart_Diversification();<br />
11 }<br />
12 return best_Sol;<br />
end EDI_Optimization<br />
Figure 6. EDI optimization algorithm<br />
Figure 4: Optimization Algorithm <strong>of</strong> EDI with Static Configuration[3]<br />
WCSL <strong>of</strong> <strong>the</strong> application 44 can be fur<strong>the</strong>r improved. But because <strong>of</strong> <strong>the</strong> limitation <strong>of</strong> hardware<br />
resource, this is not always possible. In this case, <strong>the</strong> reconfiguration <strong>of</strong> an error<br />
detector module <strong>of</strong> a process has to wait until <strong>the</strong> execution <strong>of</strong> ano<strong>the</strong>r process is done.<br />
In <strong>the</strong> optimization algorithm for EDI with PDR FPGAs, scheduling <strong>of</strong> processes on <strong>the</strong><br />
processor and placement <strong>of</strong> <strong>the</strong> corresponding EDIs on <strong>the</strong> FPGA are simultaneously performed.<br />
To implement this, <strong>the</strong> fault-tolerant schedule syn<strong>the</strong>sis tool discussed in Section<br />
3.2 is not feasible, because <strong>the</strong> particular issues related to PDR have not been taken into<br />
account by <strong>the</strong> priority function <strong>of</strong> this tool, which decides <strong>the</strong> order <strong>of</strong> process execution.<br />
So under <strong>the</strong> PDR assumptions, <strong>the</strong> priority function has to be modified. The new priority<br />
function for <strong>the</strong> optimization algorithm now is described as:<br />
f(EST, W CET, area, P CP ) = x × EST + y × W CET + z × area + w × P CP<br />
In this priority function, <strong>the</strong> parameter EST(earliest execution start time <strong>of</strong> a process)<br />
gives information about <strong>the</strong> placement and reconfiguration <strong>of</strong> EDI modules on <strong>the</strong> FPGA,<br />
WCET and EDI area characterize <strong>the</strong> EDI <strong>of</strong> each process and PCP captures <strong>the</strong> particular<br />
characteristics <strong>of</strong> each application. Meanwhile, <strong>the</strong> value <strong>of</strong> each coefficient(x,y,z,w) in<br />
<strong>the</strong> priority function is defined between -1 and 1, with <strong>the</strong> value step <strong>of</strong> 0.25. Because <strong>of</strong><br />
<strong>the</strong> existence <strong>of</strong> <strong>the</strong>se coefficients, a new type <strong>of</strong> moves concerning <strong>the</strong> weights x,y,z and<br />
w can be added. Thus, under <strong>the</strong> assumptions <strong>of</strong> PDR FPGAs, <strong>the</strong> optimization algorithm<br />
for FPGA with static configuration can be extended as follows: In each iteration, different<br />
values for <strong>the</strong> weights are explored ahead <strong>of</strong> different EDI assignments to processes. It<br />
is checked first whe<strong>the</strong>r <strong>the</strong> changes <strong>of</strong> values <strong>of</strong> coefficients bring a better priority function<br />
leading to a smaller WCSL. If <strong>the</strong> answer is negative, different EDI assignments to<br />
processes are explored, exactly as done in <strong>the</strong> previous optimization algorithm.<br />
4 Experimental Results<br />
In [3] experiments were performed on syn<strong>the</strong>tic examples to show <strong>the</strong> result after applying<br />
<strong>the</strong> optimization algorithm. Process graphs were generated with 20, 40, 60, 80, 100<br />
27
Error Detection Technique and its Optimization for Real-Time Embedded Systems<br />
and 120 processes each, mapped on architectures consisting <strong>of</strong> 3, 4, 5, 6, 7 and 8 nodes<br />
respectively. 15 graphs were generated for each application size, out <strong>of</strong> which 8 have a<br />
random structure and 7 have a tree-like structure. Worst-case execution times for processes<br />
were assigned randomly within <strong>the</strong> 10 to 250 time units range.<br />
To determine time overheads and hardware cost for each EDI, two experiment classes<br />
were generated: <strong>the</strong> first one, testcase 1, was based on <strong>the</strong> estimation <strong>of</strong> overheads done<br />
by Pattabiraman et al. in [6] and by Lyle et al in [4]. For <strong>the</strong> o<strong>the</strong>r one, testcase 2, <strong>the</strong><br />
hardware used in it was assumed slower. Thus, if <strong>the</strong> same time overheads as testcase<br />
1 need to be reached, more hardware is required. Figure 5 shows <strong>the</strong> ranges used for<br />
randomly generating <strong>the</strong> overheads. Figure 5a shows <strong>the</strong> range status <strong>of</strong> testcase 1. As<br />
time<br />
overhead<br />
x100%<br />
3<br />
0.8<br />
0.7<br />
0.3<br />
0.25<br />
0.05<br />
SW<br />
only<br />
0.05 0.15<br />
mixed<br />
HW/SW<br />
testcase1<br />
0.5<br />
HW<br />
only<br />
1<br />
0.3<br />
0.25<br />
0.05<br />
HW<br />
cost<br />
time<br />
overhead<br />
x100%<br />
3<br />
0.8<br />
0.7<br />
SW<br />
only<br />
mixed<br />
HW/SW<br />
testcase2<br />
0.15 0.55 0.75<br />
x100%<br />
Figure 14. Comp<br />
a) b)<br />
Figure 12. Ranges for random generation <strong>of</strong> EDI overheads <strong>the</strong>oretical optimum wor<br />
Figure 5: Ranges for random generation <strong>of</strong> EDI overheads[3] <strong>of</strong> WCSLstatic. Of course,<br />
Syn<strong>the</strong>tic experiments<br />
tion only for <strong>the</strong> applicati<br />
EDI overheads<br />
shown, for <strong>the</strong> SW-only EDI, <strong>the</strong> range <strong>of</strong> time overhead can reach maximum 300% HW and fraction.<br />
minimum 80% <strong>of</strong> <strong>the</strong> worst-case testcase1 execution time <strong>of</strong> <strong>the</strong> corresponding testcase2 process; <strong>the</strong> HWFigure<br />
14 shows <strong>the</strong> av<br />
cost is absolutelyApplication 0. For <strong>the</strong> Mixed SW/HW EDI, <strong>the</strong> range <strong>of</strong> time overhead is between our heuristic and for <strong>the</strong><br />
size<br />
30% and 70%, and <strong>the</strong> HW cost overhead range is between 5% and 15%. Last, for HW- <strong>the</strong> differences between<br />
20 40 … 120 20 40 …<br />
only EDI, <strong>the</strong> range <strong>of</strong> time overhead decreases to 5%-25%, while <strong>the</strong>120 range for HW cost 1% for testcase1, and up<br />
increases to 50%-100%. In Figure 5b, <strong>the</strong> range <strong>of</strong> time overhead stays <strong>the</strong> same, buteffectiveness <strong>the</strong> <strong>of</strong> our appro<br />
range <strong>of</strong> HW HW cost fraction is pushed more to <strong>the</strong> right.<br />
Next, we were interest<br />
5% 10%<br />
…<br />
100% 5% 10%<br />
…<br />
100%<br />
tion assigned to each FPG<br />
4.1 Results for static Figure reconfiguration<br />
13. Syn<strong>the</strong>tic experiment space<br />
shows <strong>the</strong> average impr<br />
mixed HW/SW implementation, <strong>the</strong> time overhead range is between<br />
heuristic. It can be seen<br />
Here <strong>the</strong> SW-only EDI is taken to show <strong>the</strong> result after <strong>the</strong> optimization algorithm<br />
30% and 70%, and <strong>the</strong> HW cost overhead range is between 5% and<br />
64% for (compared to <strong>the</strong> b<br />
FPGA with static configuration is applied. To show <strong>the</strong> effectiveness <strong>of</strong> <strong>the</strong> optimization<br />
15%. Finally, <strong>the</strong> HW-only implementation would incur a time<br />
assigning more HW to F<br />
algorithm, <strong>the</strong> results generated by <strong>the</strong> optimization algorithm(indicated by "heuristic"<br />
overhead between 5% and 25% and a HW cost overhead between<br />
also observe that this hap<br />
in Figure 6) were compared with <strong>the</strong> <strong>the</strong>oretical optimum generated by <strong>the</strong> Branch<br />
50% and 100%. Figure 12b depicts <strong>the</strong> ranges for testcase2: <strong>the</strong><br />
point, and assigning more HW<br />
Bound(BB) algorithm. The performance improvement(PI) was calculated out as follows:<br />
time overhead ranges are <strong>the</strong> same, but we pushed <strong>the</strong> HW cost<br />
<strong>the</strong> saturation point, all p<br />
� �<br />
ranges more to <strong>the</strong> Wright. Also note that for testcase2, <strong>the</strong> centers <strong>of</strong><br />
length already have <strong>the</strong>ir<br />
CSLbaseline − W CSLstatic<br />
P I =<br />
× 100 %<br />
gravity <strong>of</strong> <strong>the</strong> considered areas are more uniformly distributed. The<br />
for o<strong>the</strong>r processes into H<br />
W CSLbaseline<br />
execution time overheads and <strong>the</strong> HW cost overheads for <strong>the</strong> proc-<br />
We would also like to p<br />
esses in our syn<strong>the</strong>tic examples are distributed uniformly in <strong>the</strong><br />
we can reduce <strong>the</strong> WCS<br />
intervals depicted in Figure 12a (testcase1) and Figure 12b (test-<br />
ment >50%), for testcas<br />
case2).<br />
WCSL by half, we need<br />
We also varied <strong>the</strong> size <strong>of</strong> every FPGA 28<br />
available for placement <strong>of</strong><br />
to <strong>the</strong> assumptions we m<br />
error detection. We proceeded as follows: we sum up all <strong>the</strong> HW<br />
(see Figure 12), namely<br />
cost overheads corresponding to <strong>the</strong> HW-only implementation, for<br />
need more HW in order<br />
all processes <strong>of</strong> a certain application:<br />
case1. As we can see from<br />
HW<br />
only<br />
1<br />
HW<br />
cost<br />
x100%<br />
Average improvement<br />
70%<br />
60%<br />
50%<br />
40%<br />
30%<br />
20%<br />
10%<br />
0%<br />
5%:<br />
10%:<br />
15%:<br />
testcase<br />
20%:<br />
25%:<br />
30%
d<br />
SW<br />
testcase2<br />
HW<br />
only<br />
0.55 0.75<br />
1<br />
HW<br />
cost<br />
x100%<br />
Wei Cao<br />
The W CSLstatic is <strong>the</strong> result calculated according to <strong>the</strong> optimization algorithm,<br />
while <strong>the</strong> W CSLbaseline is <strong>the</strong> optimal result. Figure 6 shows <strong>the</strong> final results. From<br />
Average improvement<br />
70%<br />
60%<br />
50%<br />
40%<br />
30%<br />
20%<br />
10%<br />
0%<br />
5%:<br />
10%:<br />
15%:<br />
20%:<br />
b)<br />
<strong>of</strong> EDI overheads<br />
Figure 14. Comparison with <strong>the</strong>oretical optimum<br />
<strong>the</strong>oretical Figure optimum 6: Comparison worst-case with schedule <strong>the</strong>oretical length, optimum[3] WCSLopt, instead<br />
<strong>of</strong> WCSLstatic. Of course, it was possible to obtain <strong>the</strong> optimal solution<br />
only for <strong>the</strong> application size <strong>of</strong> 20 and examples with up to 40%<br />
HW fraction.<br />
testcase2<br />
4.2 Results<br />
Figure<br />
for<br />
14<br />
PDR<br />
shows<br />
FPGAs<br />
<strong>the</strong> average improvement over all test cases for<br />
our heuristic and for <strong>the</strong> optimal solution. Considering all <strong>the</strong> cases,<br />
<strong>the</strong> differences between our heuristic and <strong>the</strong> optimum were up to<br />
40 … 120<br />
1% for testcase1, and up to 2.5% for testcase2, which shows <strong>the</strong><br />
effectiveness <strong>of</strong> our approach.<br />
follows: Next, we were interested to evaluate <strong>the</strong> impact <strong>of</strong> <strong>the</strong> HW frac-<br />
10%<br />
…<br />
100%<br />
tion assigned P I to each FPGA, on <strong>the</strong> WCSL improvement. Figure 15<br />
nt space<br />
shows <strong>the</strong> average improvement we obtained when running our<br />
rhead range is between<br />
heuristic. It can be seen that we shortened <strong>the</strong> WCSL with up to<br />
ge is between 5% and<br />
64% (compared to <strong>the</strong> baseline – SW-only solution). As expected,<br />
would incur a time<br />
assigning more HW to FPGAs increases <strong>the</strong> improvement. We can<br />
ost overhead between<br />
also observe that this happens up to a saturation point: beyond that<br />
ges for testcase2: <strong>the</strong><br />
point, assigning more HW area does not help. The reason is that, at<br />
pushed <strong>the</strong> HW cost<br />
<strong>the</strong> saturation point, all processes having an impact on <strong>the</strong> schedule<br />
stcase2, <strong>the</strong> centers <strong>of</strong><br />
length already have <strong>the</strong>ir best EDI assigned, while moving <strong>the</strong> EDI<br />
ormly distributed. The<br />
for o<strong>the</strong>r processes into HW does not impact <strong>the</strong> WCSL.<br />
verheads for <strong>the</strong> proc-<br />
We would also like to point out that, with only 15% HW fraction,<br />
uted uniformly in <strong>the</strong><br />
we can reduce <strong>the</strong> WCSL by more than half (i.e. get an improve-<br />
and Figure 12b (testment<br />
>50%), for testcase1. For testcase2, in order to reduce <strong>the</strong><br />
WCSL by half, we need ~40% HW fraction. This difference is due<br />
ilable for placement <strong>of</strong><br />
e sum up all <strong>the</strong> HW<br />
ly implementation, for<br />
to <strong>the</strong> assumptions we made when generating testcase2 examples<br />
(see Figure 12), namely that <strong>the</strong> hardware is slower and, thus, we<br />
need more HW in order to get <strong>the</strong> same performance as for testcase1.<br />
As we can see from Figure 15, this difference also influences<br />
<strong>the</strong> saturation point for testcase2 (~90% HW fraction, compared to<br />
~60% for testcase1).<br />
onsidering <strong>the</strong> size <strong>of</strong> 8.2 PDR Approach<br />
P DR � �<br />
W CSLstatic − W CSLP DR<br />
=<br />
× 100 %<br />
W CSLstatic<br />
fraction <strong>of</strong> 25%).<br />
5 Conclusion<br />
29<br />
25%:<br />
30%:<br />
35%:<br />
heuristic BB (optimum)<br />
testcase1 testcase2<br />
40%:<br />
5%:<br />
HW fraction<br />
a general view, for testcase 1, <strong>the</strong> biggest difference between <strong>the</strong> optimization algorithm<br />
and <strong>the</strong> optimum reached 1%, while for testcase 2, <strong>the</strong> biggest difference went up to 2.5%.<br />
Here <strong>the</strong> efficiency <strong>of</strong> implementing error detection on FPGAs with partial dynamic reconfiguration<br />
was tested, but <strong>the</strong> experiment setup was <strong>the</strong> same as in <strong>the</strong> static case. The<br />
efficiency was evaluated through <strong>the</strong> comparison with <strong>the</strong> results <strong>of</strong> <strong>the</strong> static approach.<br />
Similarly with <strong>the</strong> static approach, <strong>the</strong> performance improvement here is described as<br />
The W CSLP DR is <strong>the</strong> result generated by <strong>the</strong> optimization algorithm for FPGA with<br />
PDR. Figure 7 shows <strong>the</strong> final results. Through <strong>the</strong> comparison with <strong>the</strong> static approach,<br />
one result can be observed that <strong>the</strong> schedule length can be shortened with up to 36%<br />
for testcase 1(with a HW fraction <strong>of</strong> 5%) and with up to 34% for testcase 2(with a HW<br />
For error detection implementation, <strong>the</strong> SW-only approach in which both path tracking<br />
and variable checking are implemented in s<strong>of</strong>tware doesn’t require hardware resource,<br />
but it leads to considerably performance overhead; <strong>the</strong> HW-only approach in which both<br />
path tracking and variable checking are performed in hardware reduces <strong>the</strong> performance<br />
overhead, but it may lead to costs sometimes exceeding <strong>the</strong> amount <strong>of</strong> resources. Since<br />
10%:<br />
15%:<br />
20%:<br />
25%:<br />
30%:<br />
35%:<br />
40%:
transition from SW-only, to mixed HW/SW and <strong>the</strong>n to HW-only speed-limit regulations and h<br />
implementation <strong>of</strong> error detectors is smoo<strong>the</strong>r and more uniform dure in extreme situations. T<br />
(see Figure 12b). In o<strong>the</strong>r words, <strong>the</strong> gap (concerning HW cost) controller is as follows: base<br />
between mixed HW/SW and HW-only implementation is smaller in speed-limit regulations, <strong>the</strong> S<br />
Error Detection testcase2. As Technique expected, and<strong>the</strong> itsmaximum Optimization improvement for Real-Time (34%) Embedded in this Systems speed limit allowed in a ce<br />
second case corresponds to a HW fraction <strong>of</strong> ~25% (compared with process calculates <strong>the</strong> relati<br />
5% for testcase1).<br />
component is also used to<br />
40<br />
HW:<br />
5%: 10%: 15%: 20%: 25%: 30%: 35%: 40%:<br />
testcase1<br />
60%: 80%: 90%: 100%:<br />
need to use <strong>the</strong> brake assist<br />
trigger <strong>the</strong> execution <strong>of</strong> <strong>the</strong> A<br />
35<br />
30<br />
25<br />
20<br />
The ACC assembly (P9 and P<br />
BrakeAssist process is used t<br />
in front <strong>of</strong> <strong>the</strong> vehicle that m<br />
Average Improvement<br />
Average Improvement<br />
15<br />
10<br />
5<br />
0<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
20 Tasks 40 Tasks 60 Tasks 80 Tasks 100 Tasks 120 Tasks<br />
Application size<br />
testcase2<br />
20 Tasks 40 Tasks 60 Tasks 80 Tasks 100 Tasks 120 Tasks<br />
Application size<br />
ACC<br />
assembly<br />
P1 P2 P3 P4<br />
Figure 16. Improvement - PDR over static approach Figure 18. Ad<br />
Figure 7: Improvement- PDR over Static Approach[3]<br />
each application consists <strong>of</strong> a certain number <strong>of</strong> processes, EDI can be applied to each<br />
process. Through <strong>the</strong> optimization <strong>of</strong> <strong>the</strong> EDI for each process, <strong>the</strong> optimization 49 <strong>of</strong> <strong>the</strong><br />
WCSL for <strong>the</strong> application can be achieved. Two optimization algorithms are introduced,<br />
one for EDI on FPGA with static configuration and <strong>the</strong> o<strong>the</strong>r, for EDI on FPGA with<br />
PDR. For EDI on FPGA with static configuration, <strong>the</strong> optimization algorithm assigns<br />
different EDIs to processes to minimize <strong>the</strong> WCSL, while <strong>the</strong> optimization algorithm for<br />
EDI on FPGA with PDR explores different weight values <strong>of</strong> <strong>the</strong> priority function before<br />
<strong>the</strong> assignment <strong>of</strong> EDIs to processes. Experimental results have shown <strong>the</strong> improvement<br />
<strong>of</strong> <strong>the</strong> WCSL <strong>of</strong> <strong>the</strong> application after applying <strong>the</strong> corresponding algorithms and proved<br />
<strong>the</strong>ir effectiveness.<br />
References<br />
[1] D. Evans, J. Guttag, J. Horning, and Y.M. Tan. Lclint: A tool for using specifications<br />
to check code. In ACM SIGSOFT S<strong>of</strong>tware Engineering Notes, volume 19, pages<br />
87–96. ACM, 1994.<br />
[2] V. Izosimov, P. Pop, P. Eles, and Z. Peng. Syn<strong>the</strong>sis <strong>of</strong> fault-tolerant schedules with<br />
transparency/performance trade-<strong>of</strong>fs for distributed embedded systems. In <strong>Proceedings</strong><br />
<strong>of</strong> <strong>the</strong> conference on Design, automation and test in Europe: <strong>Proceedings</strong>, pages<br />
706–711. European Design and Automation Association, 2006.<br />
30<br />
P6<br />
P9<br />
P10<br />
P12<br />
sensors<br />
P7<br />
P8<br />
actuator
Wei Cao<br />
[3] A. Lifa, P. Eles, Z. Peng, and V. Izosimov. <strong>Hardware</strong>/s<strong>of</strong>tware optimization <strong>of</strong> error<br />
detection implementation for real-time embedded systems. In <strong>Hardware</strong>/S<strong>of</strong>tware<br />
<strong>Codesign</strong> and System Syn<strong>the</strong>sis (CODES+ ISSS), 2010 IEEE/ACM/IFIP International<br />
Conference on, pages 41–50. IEEE, 2010.<br />
[4] G. Lyle, S. Chen, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer. An end-to-end approach<br />
for <strong>the</strong> automatic derivation <strong>of</strong> application-aware error detectors. In Dependable<br />
Systems & Networks, 2009. DSN’09. IEEE/IFIP International Conference on,<br />
pages 584–589. IEEE, 2009.<br />
[5] K. Pattabiraman, Z. Kalbarczyk, and R.K. Iyer. Application-based metrics for strategic<br />
placement <strong>of</strong> detectors. In Dependable Computing, 2005. <strong>Proceedings</strong>. 11th Pacific<br />
Rim International Symposium on, pages 8–pp. IEEE, 2005.<br />
[6] K. Pattabiraman, Z.T. Kalbarczyk, and R.K. Iyer. Automated derivation <strong>of</strong><br />
application-aware error detectors using static analysis: The trusted illiac approach.<br />
Dependable and Secure Computing, IEEE Transactions on, 8(1):44–57, 2011.<br />
[7] C.R. Reeves. Modern heuristic techniques for combinatorial problems. John Wiley<br />
& Sons, Inc., 1993.<br />
[8] F. Tip. A survey <strong>of</strong> program slicing techniques. 1994.<br />
31
CPU vs. GPU: Which One Will Come Out on Top?<br />
Why There is no Simple Answer<br />
Denis Dridger<br />
University <strong>of</strong> Paderborn<br />
dridger@mail.upb.de<br />
January, 12 2012<br />
Abstract<br />
Today’s applications need to process an enormous amount <strong>of</strong> data due to evergrowing<br />
user requirements. Since traditional single-core CPUs have reached <strong>the</strong>ir<br />
speed limits, vendors nowadays provide powerful multi-core architectures to cope<br />
with <strong>the</strong> computation load. Although <strong>the</strong>se architectures provide significant speedups<br />
compared to single-core CPUs, ano<strong>the</strong>r trend emerged in <strong>the</strong> past few years: performing<br />
general purpose computations on graphics processing units (GPUs). The<br />
fast-paced evolution <strong>of</strong> GPUs allows to use more and more computing power along<br />
with a reasonable programming model. Ever since many publications presented phenomenal<br />
speedups, up to several hundred fold over CPUs.<br />
In this paper we take a critical look at those claims and clarify that interpreting<br />
such speedups should be done carefully. In doing so we discuss <strong>the</strong> question whe<strong>the</strong>r<br />
achieving such speedups is realistic or just a myth. There are many parameters that<br />
should be considered when conducting speedup measurements in order to obtain a<br />
meaningful result. Unfortunately many publications <strong>of</strong>ten omit or conceal important<br />
details such as time for data transfers between GPU and CPU or performed optimizations<br />
to CPU code. In fact we find that many reported speedups might decrease<br />
easily by a factor <strong>of</strong> 10 or more, if such considerations were made.<br />
33
Denis Dridger<br />
1 Introduction<br />
Today, applications require an immense computing power to satisfy <strong>the</strong> ever-growing<br />
needs <strong>of</strong> <strong>the</strong> high-performance computing community. In <strong>the</strong> recent years <strong>the</strong> computing<br />
industry recognized that traditional single-core architectures can not meet <strong>the</strong>se demands<br />
anymore, and began to move toward multi-core and many-core systems [3]. Given <strong>the</strong><br />
fact that parallelism is <strong>the</strong> future <strong>of</strong> computing, hardware designers continuously focus<br />
on adding more processing cores. The recent trend, is to perform high-performance computations<br />
also on graphics processing units (GPUs). GPUs evolved to powerful graphics<br />
engines, which feature programmability, peak arithmetic and memory bandwidth and can<br />
compete with modern CPU architectures [1]. The number <strong>of</strong> available processing units<br />
in a GPU exceeds <strong>the</strong> number <strong>of</strong> available CPU cores by far. For example an NVIDIA’s<br />
GTX280 graphics card (which is not a high-end GPU anymore) possesses 240 processing<br />
units, while Intel’s iCore7 CPU provides only 4 cores. In addition GPU vendors also provide<br />
powerful programming models that enable <strong>the</strong> user to port many applications to <strong>the</strong><br />
GPU and leverage its massive parallel computing power. The most notable programming<br />
model is NVIDIA’s Compute Unified Device Architecture (CUDA) [7], which allows<br />
programming GPUs in a C-like language. After CUDA’s appearance in 2007, many researchers<br />
grabbed <strong>the</strong> opportunity to accelerate diverse algorithms on GPUs and reported<br />
significant speedups as high as 100X and far beyond, compared to CPU based approaches.<br />
However, Lee et al. [14] claim that achieving such speedups is a myth. Although<br />
this paper is very recent, it has already reached an immense popularity status. Motivated<br />
by this publication we take an objective look at it, as well as at many o<strong>the</strong>r papers that<br />
debate <strong>the</strong> CPU vs. GPU performance. In doing so we try to find evidence that supports<br />
or objects this claim. Studying different publications that report about<br />
• speedups that have been achieved on GPUs ([9, 12, 13, 18, 20, 22, 23, 24, 25, 27])<br />
• optimization opportunities for CPU and GPU ([8, 17, 19])<br />
• considerations when conducting performance comparisons between CPU and GPU<br />
([2, 10, 14, 26])<br />
we find that many papers in fact do not provide completely fair performance comparisons<br />
or conceal important details concerning performance comparisons. The study shows that<br />
<strong>the</strong>re is a number <strong>of</strong> parameters that influence <strong>the</strong> performance comparison results, which<br />
implies that reported results should be employed carefully. In many cases it is not very<br />
meaningful to say that GPU is X times faster than CPU because <strong>of</strong> <strong>the</strong> following parameters<br />
• used hardware (e.g. single-threaded CPU vs. high-end GPU)<br />
• performed optimizations (e.g. non optimized CPU code vs. optimized GPU code)<br />
• consideration <strong>of</strong> data transfers between CPU and GPU<br />
34
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
• used application (e.g. serial code vs. highly parallel code)<br />
• intention <strong>of</strong> <strong>the</strong> author (e.g. CPU vendor vs. GPU vendor)<br />
In this work we discuss <strong>the</strong> above mentioned influence parameters and try to answer <strong>the</strong><br />
question whe<strong>the</strong>r achieving such great speedups is a myth or really possible. The answer<br />
is: it depends! Though it is not possible to provide a definite answer to this question, this<br />
work provides some interesting insights that may help to understand where tremendous<br />
speedups <strong>of</strong> more than 100X might come from.<br />
The remainder <strong>of</strong> this paper is structured as follows. The next section introduces <strong>the</strong><br />
new trend <strong>of</strong> performing computations on GPUs. It covers a brief overview on <strong>the</strong> CUDA<br />
programming model and several examples <strong>of</strong> applications for which great speedups have<br />
been achieved. Section 3 provides information on technical aspects <strong>of</strong> CPUs and GPUs<br />
and highlights <strong>the</strong> differences between <strong>the</strong> both platforms. Here <strong>the</strong> features <strong>of</strong> each platform<br />
are described on a level that is reasonable for understanding <strong>the</strong> differences between<br />
<strong>the</strong> platforms as well as <strong>the</strong>ir approaches <strong>of</strong> processing data. The next two sections form<br />
<strong>the</strong> core <strong>of</strong> <strong>the</strong> paper. Section 4 tries to clarify why comparing <strong>the</strong> performance between<br />
CPU and GPU is not an easy task. In particular it is put straight why <strong>the</strong> results <strong>of</strong> such<br />
comparisons may vary from paper to paper by several orders <strong>of</strong> magnitude. In section<br />
5 we impartially discuss <strong>the</strong> claim that achieving 100X GPU speedups is just a myth, as<br />
suggested by Lee et al. [14] with <strong>the</strong> help <strong>of</strong> our previous considerations. Finally <strong>the</strong><br />
work is concluded in section 6.<br />
2 The New Trend: General Purpose Computing on GPUs<br />
The GPU is no more just a fixed-function processor, which was designed to accelerate 3D<br />
applications. Over <strong>the</strong> past few years <strong>the</strong> GPU evolved to a highly parallel and flexible<br />
programmable processor featuring special purpose arithmetic units. With GPUs one gets<br />
much computing power for low cost. Today’s GPUs can provide peak performance <strong>of</strong><br />
over 1 TFlop/s and peak bandwidth <strong>of</strong> over 100 GiB/s [9]. Figure 1 shows <strong>the</strong> performance<br />
increase over <strong>the</strong> past few years. As <strong>the</strong> figure suggests each year <strong>the</strong> <strong>the</strong>oretical<br />
performance was nearly doubled, which attracted <strong>the</strong> interest <strong>of</strong> more and more application<br />
developers and researchers.<br />
Ano<strong>the</strong>r very important reason why today’s GPUs are so attractive, is <strong>the</strong>ir programmability.<br />
With <strong>the</strong> appearance <strong>of</strong> CUDA, programmers do not need to deal with cumbersome<br />
graphics APIs anymore (that were actually designed to handle polygons and pixels) when<br />
porting an application to <strong>the</strong> GPU. CUDA is probably <strong>the</strong> best known and most used programming<br />
model that is currently available. All studied publications concerning GPU<br />
performance or optimizations use CUDA, <strong>the</strong>refore we will also focus on CUDA and<br />
NVIDIA’s GPU architecture in this work.<br />
CUDA also refers to NVIDIA’s hardware architecture, which is tightly coupled to <strong>the</strong><br />
programming model [7]. The hardware architecture is introduced in <strong>the</strong> next section. In<br />
35
Denis Dridger<br />
Figure 1: GPU performance increase over years. Figure is adapted from [5].<br />
this section we want to take a brief look at CUDA, <strong>the</strong> programming model and some<br />
application examples for which notable speedups have been achieved using CUDA.<br />
2.1 The CUDA Programming Model<br />
In <strong>the</strong> CUDA model a GPU is considered as an accelerator that is capable <strong>of</strong> executing<br />
parallel code and special purpose code like ma<strong>the</strong>matical arithmetics. The code that shall<br />
be accelerated on <strong>the</strong> GPU is referred to as a kernel. CUDA programs are basically C<br />
programs with extensions to leverage GPU’s parallelism and consist <strong>of</strong> two parts: <strong>the</strong><br />
non-critical part that shall run on <strong>the</strong> CPU and <strong>the</strong> critical part, <strong>the</strong> kernel, that shall run<br />
on <strong>the</strong> GPU. Executing a kernel, <strong>the</strong> GPU runs many threads concurrently, each <strong>of</strong> which<br />
executes <strong>the</strong> same program on different data. This approach is known as SPMD (Single<br />
Program, Multiple Data). An illustration <strong>of</strong> <strong>the</strong> thread execution in <strong>the</strong> CUDA model is<br />
shown in Figure 2.<br />
CUDA programs consist <strong>of</strong> mixed code for CPU and GPU. The CPU (host) code is<br />
an ordinary C program, whereas <strong>the</strong> GPU code is written as a C kernel, using additional<br />
keywords and structures. In addition <strong>the</strong>re are several restrictions on <strong>the</strong> kernel code: no<br />
recursion, no static variables and no variable numbers <strong>of</strong> function parameters. Both code<br />
fragments are compiled separately by <strong>the</strong> NVIDIA CUDA C compiler as shown in figure<br />
3. The kernel execution on GPU is launched by <strong>the</strong> host. The host code is also responsible<br />
for transferring data to and from GPU’s global memory, with <strong>the</strong> help <strong>of</strong> special API calls.<br />
36
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
Figure 2: The CUDA model considers <strong>the</strong> CPU as host, which runs code with no/low<br />
parallelism. The GPU is treated as an accelerator, which executes parallel code by running<br />
thousands <strong>of</strong> threads at <strong>the</strong> same time. Figure is adapted from [12].<br />
Figure 3: The CUDA compilation flow. Figure is adapted from [19].<br />
2.2 Application Examples<br />
Over <strong>the</strong> past few years researchers have ported different applications to <strong>the</strong> GPU, in<br />
particular using CUDA. The accelerated applications come from various areas including<br />
engineering, medicine, finance, cryptography and multimedia. In <strong>the</strong> majority <strong>of</strong> cases <strong>the</strong><br />
here applied algorithms solve problems that deal with searching, sorting, ma<strong>the</strong>matical<br />
computations and image processing.<br />
Next, several examples for accelerated algorithms are presented that were taken from<br />
recent publications. Although <strong>the</strong>re were no special criteria for selecting <strong>the</strong> papers, most<br />
<strong>of</strong> <strong>the</strong> chosen publications report significant speedups compared to corresponding CPU<br />
implementations. At this point we do not want to consider <strong>the</strong> performance comparison<br />
details such as exactly used hardware or performed optimizations. We will take a closer<br />
look at <strong>the</strong>se details in section 4 and 5. In all cases, <strong>the</strong> algorithms were implemented<br />
on high-end (or almost high-end) NVIDIA GPUs that were available at that time. The<br />
37
Denis Dridger<br />
corresponding CPU implementations, in contrast, were run only in best case on high-end<br />
CPUs. In addition, <strong>the</strong>se implementations were optimized questionably or not optimized<br />
at all.<br />
• Sparse matrix vector product (SpMV) is <strong>of</strong> great importance in linear algebra and<br />
hence in engineering and scientific programs. There has been much work improving<br />
<strong>the</strong> performance <strong>of</strong> SpMV on various systems in <strong>the</strong> last years. Vazquez et al. [24]<br />
implemented SmMV on GPU and achieved a speedup <strong>of</strong> 30X.<br />
• Fast Fourier transforms (FFT) is also a very important algorithm, which transforms<br />
signals in <strong>the</strong> time domain into <strong>the</strong> frequency domain. Naga et al. [9]<br />
achieved a speedup <strong>of</strong> 40X.<br />
• Fast Multipole Methods (FMM) is widely used for problems arising in diverse areas<br />
(molecular dynamics, astrophysics, acoustics, fluid mechanics, electromagnetics,<br />
scattered data interpolation etc.) because <strong>of</strong> its ability to achieve linear time and<br />
memory dense matrix vector products with a fixed prescribed accuracy. Gumerov<br />
et al. [11] achieved a speedup <strong>of</strong> 60X.<br />
• Database operations also have parallelization potential. Bakkum et al. [4] implemented<br />
a subset <strong>of</strong> <strong>the</strong> SQLite command processor on <strong>the</strong> GPU and achieved<br />
speedups between 20X and 70X.<br />
• Password recovery algorithms provide excellent opportunities to exploit parallelism<br />
since passwords can be checked independently. Hu et al. [12] and Phong<br />
et al. [18] achieved speedups <strong>of</strong> over 50X and 170X respectively.<br />
• Image processing is ano<strong>the</strong>r important application domain, which promises good<br />
speedup results due to <strong>the</strong> low data dependency. Zhiyi et al. [27] achieved speedups<br />
up to 200X.<br />
• Sum-product or “marginalize a product <strong>of</strong> functions problem“, is a ra<strong>the</strong>r simple<br />
kernel, which is used in different real-life applications. Silberstein et al. [22]<br />
achieved a speedup <strong>of</strong> 270X.<br />
3 Differences Between Today’s CPUs and GPUs<br />
In this section we highlight <strong>the</strong> differences between <strong>the</strong> two platforms and try to state<br />
some reasons why computing on GPUs may be a reasonable option.<br />
3.1 The CPU<br />
CPUs are designed to support a wide variety <strong>of</strong> applications, which can be single-threaded<br />
or multi-threaded. In oder to improve <strong>the</strong> performance <strong>of</strong> single-threaded applications, <strong>the</strong><br />
38
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
CPU makes use <strong>of</strong> instruction-level parallelism, where several instructions can be issued<br />
at <strong>the</strong> same time. Multi-threaded applications may leverage additional cores along with<br />
<strong>the</strong> SIMD (Same Instruction Multiple Data) technology. Modern CPUs possess four to<br />
eight cores, run at a frequency above 3GHz and provide o<strong>the</strong>r useful features such as<br />
branch prediction. Intel’s Hyper-Threading technology allows a single physical processor<br />
to execute two heavyweight threads (processes) at <strong>the</strong> same time, dynamically sharing <strong>the</strong><br />
processor resources [15]. An example for such a processor is Intel’s Core i7 CPU, which<br />
is used by Lee et al. in [14], to show that CPUs can/might compete against GPUs.<br />
However, providing all this architectural advances in order to support general purpose<br />
computing well, results in ra<strong>the</strong>r complex chips, and thus large chip areas, which in turn<br />
limits <strong>the</strong> number <strong>of</strong> cores that can be placed onto <strong>the</strong> chip. Since <strong>the</strong> number <strong>of</strong> application<br />
pieces that can be processed in parallel is limited by available parallel processing<br />
resources <strong>of</strong> <strong>the</strong> processor, GPUs become more interesting to researchers and application<br />
developers.<br />
3.2 The GPU<br />
The GPU provides many scalar processor cores, each <strong>of</strong> which is ra<strong>the</strong>r simple compared<br />
to a CPU core. Scalar processors are grouped into multiprocessors (also known as<br />
streaming processors) and can execute <strong>the</strong> same program in parallel using threads. CUDA<br />
threads are similar to ordinary operating system threads with <strong>the</strong> difference that <strong>the</strong> overhead<br />
for creating and scheduling threads is extremely low and can be safely ignored [6].<br />
The threads again, are grouped into thread blocks that are scheduled by <strong>the</strong> GPU to <strong>the</strong><br />
multiprocessors. The modern GPU is capable <strong>of</strong> running thousands <strong>of</strong> threads at <strong>the</strong> same<br />
time, which helps to hide memory latencies. If a thread block issues a long-latency memory<br />
operation, <strong>the</strong> multiprocessor will quickly switch to an o<strong>the</strong>r block while <strong>the</strong> memory<br />
request is satisfied by <strong>the</strong> memory controller. The GPU provides different memory types.<br />
Each processor core has a very small cache, each multiprocessor has a shared memory,<br />
which can be accessed by all cores located on this multiprocessor. The device itself provides<br />
a large global memory, which can be accessed by all multiprocessors. Shared memory<br />
is an on-chip memory and can be accessed extremely fast, while accessing <strong>the</strong> global<br />
memory, which is an <strong>of</strong>f-chip memory, takes much longer. For example a GeForce 8800<br />
consumes only 4 clock cycles for fetching data from shared memory while <strong>the</strong> same operation<br />
takes 400 to 600 clock cycles for <strong>the</strong> global memory [27]. However, <strong>the</strong> size <strong>of</strong> <strong>the</strong><br />
shared memory is with ca. 16KB quite small, while <strong>the</strong> global memory provides several<br />
hundreds <strong>of</strong> megabytes.<br />
Figure 4 illustrates <strong>the</strong> organization <strong>of</strong> multiprocessors, processor cores and memory<br />
on a GTX 280 GPU. Although this GPU was introduced already in 2008, and is surely<br />
not a high-end graphics device anymore, it was used in most recent publications that were<br />
studied in this work.<br />
However, having many cores and being able to run many threads in parallel does not<br />
make <strong>the</strong> GPU that fast yet. Data throughput is a feature that can be considered as <strong>the</strong><br />
most important one. Today’s GPUs provide a bandwidth <strong>of</strong> over 100 GiB/s to keep <strong>the</strong><br />
39
Denis Dridger<br />
Figure 4: GeForce GTX 280 GPU with 240 scalar processor cores, organized in 30 multiprocessors.<br />
Figure is adapted from [20].<br />
processors busy and thus exploit as much computation power as possible. Ga<strong>the</strong>r/Scatter<br />
is ano<strong>the</strong>r pr<strong>of</strong>itable feature <strong>of</strong> <strong>the</strong> GPU, which allows to read/write data from/to noncontiguous<br />
memory addresses in <strong>the</strong> global memory. This is important to treat applications<br />
with irregular memory accesses still in SIMD fashion [1, 14, 23]. Last but not least,<br />
each multiprocessor has several built-in function units to support fast execution <strong>of</strong> texture<br />
sampling and frequently-used arithmetic operations like square root, sin and cosine.<br />
These units also contribute to kernel’s speedup if <strong>the</strong> kernel makes use <strong>of</strong> <strong>the</strong> supported<br />
functions. Ryoo et al. [19] found that <strong>the</strong>se special units contribute about 30% to <strong>the</strong><br />
speedups <strong>of</strong> <strong>the</strong> evaluated trigonometry benchmarks. Lee et al. [14] suggest that <strong>the</strong> texture<br />
sampling unit <strong>of</strong> <strong>the</strong> GTX 280 GPU greatly contributed to <strong>the</strong> speedup <strong>of</strong> a collision<br />
detection algorithm (namely GJK).<br />
In addition, <strong>the</strong> performance <strong>of</strong> graphics hardware increases rapidly. Especially, faster<br />
than that <strong>of</strong> CPUs. But how can this be? Both chips consist <strong>of</strong> transistors, after all. The<br />
reason is that many transistors built into CPUs do not contribute to <strong>the</strong> actual computational<br />
work. Instead, <strong>the</strong>y are used for non-computational tasks like branch prediction and<br />
caching, while <strong>the</strong> highly parallel nature <strong>of</strong> GPUs enables <strong>the</strong>m to use additional transistors<br />
for computation [16]. Few years ago GPU vendors introduced for <strong>the</strong> first time <strong>the</strong><br />
support <strong>of</strong> double-precision floating-point arithmetics. This innovation removed one <strong>of</strong><br />
<strong>the</strong> major obstacles for <strong>the</strong> adoption <strong>of</strong> <strong>the</strong> GPU in many scientific computing applications<br />
[1].<br />
3.3 Summary in Table Form<br />
The table below summarizes <strong>the</strong> features <strong>of</strong> CPU and GPU respectively, and highlights <strong>the</strong><br />
differences between both platforms. Here, we ignore characteristics such as performance<br />
growth rate, cost and power consumption, because <strong>the</strong>y do not directly contribute to <strong>the</strong><br />
performance achievable on <strong>the</strong> device.<br />
40
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
Table 1: Comparison <strong>of</strong> CPU and GPU features that are relevant for <strong>the</strong> computing performance.<br />
In order to present <strong>the</strong> differences in an easy comprehensible way, we rate<br />
each feature with plus (+) symbols, where + implies that <strong>the</strong> respective feature is poorly<br />
supported, while +++++ implies that <strong>the</strong> respective feature is very well supported. The<br />
table is based on information obtained from [1, 8, 16, 14].<br />
CPU GPU comment<br />
Application domain +++++ ++ GPU requires highly parallel applications<br />
Number <strong>of</strong> cores + +++++<br />
Processor frequency +++++ ++<br />
Peak throughput +++ +++++<br />
Caches/shared memory +++++ +<br />
Ga<strong>the</strong>r/Scatter + +++++ Usually no hardware support on CPU<br />
Special function units + +++ Usually none/less in CPUs but few in GPUs<br />
Chip area that contributes ++ +++++ CPU ”wastes“ many transistors for caching<br />
to computation and control logic<br />
4 Considerations When Conducting Performance Comparisons<br />
The authors <strong>of</strong> [2], [10], [14] and [26] highlight important details regarding CPU/GPU<br />
speedup comparisons. They all agree that comparisons found in publications are <strong>of</strong>ten<br />
taken out <strong>of</strong> context. In this section we introduce four parameters that influence <strong>the</strong><br />
performance comparisons and should thus be considered while conducting performance<br />
comparisons.<br />
4.1 The Application<br />
It is obvious that some applications are perfectly suitable to run on <strong>the</strong> CPU, whereas<br />
o<strong>the</strong>rs perfectly fit on <strong>the</strong> GPU. In <strong>the</strong> extreme case, we have a single-threaded application,<br />
which would leverage <strong>the</strong> corresponding CPU features and run very well. Running <strong>the</strong><br />
same application on <strong>the</strong> GPU would even result in a slow down, because only a single<br />
processor would be active, which is comparatively slow. In addition <strong>the</strong> performance<br />
would suffer from <strong>the</strong> overhead migrating <strong>the</strong> data to and from GPU’s memory. On <strong>the</strong><br />
o<strong>the</strong>r hand, running a perfectly parallelizable code that is compute bound and is largely<br />
independent from o<strong>the</strong>r operations, would provide tremendous speedups on GPU, while<br />
<strong>the</strong> CPU implementation would have to get along with <strong>the</strong> few parallel units it has. Also<br />
applications that can work on small input data sets or can generate input data directly on<br />
GPU, (i.e. without <strong>the</strong> need to fetch it from CPU) may perform well on GPUs.<br />
41
Denis Dridger<br />
4.2 The <strong>Hardware</strong><br />
When comparing <strong>the</strong> performance between CPU and GPU, <strong>the</strong> achieved speedups strongly<br />
depend on which CPU and GPU is used. For example using <strong>the</strong> next best GPU model instead<br />
<strong>of</strong> <strong>the</strong> chosen one, <strong>the</strong> <strong>the</strong>oretical deliverable performance can be doubled. That<br />
is because GPUs evolve rapidly and thus a newer GPU usually features more processing<br />
cores and higher throughput bandwidth. Usually <strong>the</strong>re is also a performance gain<br />
choosing a better CPU model, though <strong>the</strong> expected gain is not that promising as in <strong>the</strong><br />
GPU case, since <strong>the</strong> number <strong>of</strong> additional cores is very limited. It seems obvious that<br />
speedups measured on a GPU would (probably) decrease if <strong>the</strong> used CPU would feature<br />
more cores. Thus, speedups may decrease by a half if a dual-core processor is used instead<br />
<strong>of</strong> a single-core processor and so forth.<br />
But how to provide meaningful measurement results if <strong>the</strong>re is such a wide variety <strong>of</strong><br />
available CPUs and GPUs on <strong>the</strong> market? Probably it is <strong>the</strong> best to take <strong>the</strong> best available<br />
hardware for both platforms. And, as Lee et al. [14] suggest, to compare GPUs to thread<br />
and SIMD parallelized CPU code. The result would <strong>the</strong>n declare <strong>the</strong> performance gain<br />
achievable on state-<strong>of</strong>-<strong>the</strong>-art hardware.<br />
For example comparing <strong>the</strong> execution time <strong>of</strong> a kernel using an high-end GPU on<br />
<strong>the</strong> one hand, and an obsolete single-threaded CPU on <strong>the</strong> o<strong>the</strong>r hand, does provide high<br />
speedup numbers, but does not provide very usable results. Authors, whose primary aims<br />
are not to report GPU speedups, but to inform about o<strong>the</strong>r concerns such as optimization<br />
techniques, <strong>of</strong>ten choose better comparable hardware to produce objective results. Such<br />
publications include [13], [14], [17] and [23]. In [8] achieved GPU results are compared<br />
even to several CPU platforms, which is very useful since notable performance gaps to<br />
o<strong>the</strong>r CPUs are directly visible. Correspondingly, <strong>the</strong> measured speedups in <strong>the</strong>se publications<br />
are all less than 10X. In contrast, it is not very surprising that authors, who try to<br />
deliver GPU speedups that are as high as possible, (in particular higher than any reported<br />
speedups for similar algorithms) tend to choose weaker CPUs. If we take a look at our<br />
papers, which report great speedups (as mentioned in section 2.2) we find an evidence. In<br />
[4], [11], [12], [18], [22] and [27] a sequential CPU program is used as reference, while<br />
state-<strong>of</strong>-<strong>the</strong>-art GPUs are used on <strong>the</strong> o<strong>the</strong>r hand. In [24] a dual-core CPU processor is<br />
used, while quad-core processors already existed for several years. Only Naga et al. [9]<br />
implemented <strong>the</strong>ir algorithm on a high-end quad-core CPU.<br />
4.3 Performing <strong>the</strong> Optimizations<br />
A program’s code may be optimized in order to better leverage <strong>the</strong> given hardware resources.<br />
Differences in execution times <strong>of</strong> an optimized and an unoptimized program can<br />
be significant. For example, in [14], Lee et al. suggest that <strong>the</strong> speedup <strong>of</strong> an algorithm,<br />
which was reported to be 114X over CPUs, decreased to only 5X after <strong>the</strong>ir carefully<br />
optimizations. Ryoo et al. [19] researched tree searching algorithms on CPU and GPU.<br />
They confirm <strong>the</strong> fact that <strong>the</strong> speedup gap is reduced significantly if using optimized<br />
CPU code. The gap was reduced from 8X to 1.7X for large trees. For smaller trees <strong>the</strong><br />
42
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
CPU implementation was even two times faster than <strong>the</strong> GPU implementation.<br />
In <strong>the</strong> most studied publications that achieve great speedups on GPUs, <strong>the</strong> description<br />
<strong>of</strong> CPU optimization is lacking in content, whereas <strong>the</strong> optimizations <strong>of</strong> <strong>the</strong> GPU version<br />
are explained in detail. Often authors do not consider CPU optimizations at all, or just<br />
mention that <strong>the</strong>y use ”optimized“ CPU code.<br />
We want to take a look at available tuning opportunities for both platforms to get some<br />
insights how <strong>the</strong> performance can be increased. While optimizing <strong>the</strong> code, one basic<br />
approach is to reduce/hide memory latencies. Therefor, CPU designs use large caches,<br />
whereas GPU designs seek to run thousands <strong>of</strong> threads in flight. The efficient utilization<br />
<strong>of</strong> <strong>the</strong> computing resources also depends on how to extract instruction-level parallelism,<br />
thread-level parallelism and data-level parallelism.<br />
4.3.1 CPU Optimizations<br />
• Scatter/Ga<strong>the</strong>r can be realized by hand-coding <strong>the</strong> instruction sequence. This significantly<br />
improves <strong>the</strong> SIMD performance. For example, Smelyanskiy et al. [23]<br />
managed to reduce <strong>the</strong> number <strong>of</strong> instructions needed to fetch data from 4 noncontiguous<br />
memory locations, from 20 (generated by compiler) to 13.<br />
• Cache blocking is <strong>the</strong> standard technique used to reduce low-level cache misses on<br />
CPUs. Cache blocking restructures loops with frequent iterations over large data<br />
arrays by dividing <strong>the</strong>m into smaller blocks. Then, each data element in <strong>the</strong> array is<br />
reused within <strong>the</strong> data block, such that <strong>the</strong> block <strong>of</strong> data fits within <strong>the</strong> data cache,<br />
before operating on <strong>the</strong> next block. Lee et al. made intensive use <strong>of</strong> cache blocking<br />
in [14], and observed that <strong>the</strong> performance <strong>of</strong> <strong>the</strong> ”Sort“ and ”Search“ benchmarks<br />
improved by 3-5X applying <strong>the</strong> technique.<br />
• Data layout is critical for processing data in parallel, especially if no hardware<br />
support for scatter/ga<strong>the</strong>r is available. Reordering data requires a good understanding<br />
<strong>of</strong> <strong>the</strong> underlying memory structure. For example in [14], Lee et al. improve<br />
<strong>the</strong> performance <strong>of</strong> <strong>the</strong> Lattice Boltzmann method (also known as LBM), by 1.5X<br />
reordering array data structures.<br />
4.3.2 GPU Optimizations<br />
Accessing GPU’s <strong>of</strong>f-chip memory is a major bottleneck in GPU computing. Hence,<br />
reducing global memory latency is <strong>the</strong> main concern when optimizing GPU code [19, 27].<br />
The basic techniques for hiding memory latency are listed below.<br />
• Using as many threads as possible is a very common approach to hide memory<br />
latency. This improves <strong>the</strong> utilization <strong>of</strong> <strong>the</strong> processors because a great number <strong>of</strong><br />
threads can run on <strong>the</strong> processors while many o<strong>the</strong>r threads are waiting until <strong>the</strong>ir<br />
read or write request to <strong>the</strong> global memory is satisfied. Switching between active<br />
threads and inactive threads is very fast on GPU’s and hence does not cause notable<br />
43
Denis Dridger<br />
overhead as already mentioned in <strong>the</strong> previous section. So a GPU code developer<br />
should try to create as many threads as possible in his program. To fully utilize<br />
today’s GPUs it is necessary to create 5,000 to 10,000 threads [20].<br />
• Reusing data that is already located in <strong>the</strong> shared memory avoids expensive accesses<br />
to <strong>the</strong> global memory. The thread that loads a datum to <strong>the</strong> shared memory<br />
may perform a synchronization operation, so that o<strong>the</strong>r threads <strong>of</strong> <strong>the</strong> same block<br />
may access this data too, instead <strong>of</strong> fetching it from global memory.<br />
• Loading data in blocks helps reducing <strong>the</strong> global memory latency for applications<br />
that can take advantage <strong>of</strong> contiguity in main memory. An example for such an<br />
application is <strong>the</strong> matrix multiplication. Ryoo et al. [19] load parts <strong>of</strong> <strong>the</strong> matrix as<br />
nxn blocks, which are <strong>the</strong>n processed by nxn threads in parallel. For example <strong>the</strong><br />
results for two 16x16 input blocks are computed by 256 threads.<br />
4.4 Data Transfers Between CPU and GPU<br />
Time needed for memory transfers between CPU and GPU is critical to <strong>the</strong> overall performance<br />
<strong>of</strong> <strong>the</strong> application [1, 8, 10, 13, 21]. Since it is not possible to exchange data<br />
between CPU and GPU at runtime, executing a kernel on <strong>the</strong> GPU usually involves <strong>the</strong><br />
following steps:<br />
1. CPU: copy input data from CPU memory to GPU memory<br />
2. CPU: launch n instances <strong>of</strong> <strong>the</strong> kernel<br />
3. GPU: process n pieces <strong>of</strong> data in parallel<br />
4. CPU: copy output data from GPU memory to CPU memory<br />
Gregg et al. [10] recognized that many published performance comparisons do not exactly<br />
say where <strong>the</strong> data resides before kernel execution and what happens to <strong>the</strong> data after<br />
kernel execution. They argue that considering memory transfer times may reduce <strong>the</strong><br />
achieved speedups significantly. Indeed, <strong>the</strong>y show that execution time for benchmarked<br />
kernels increases by factor 2 to 50 if considering transfer times. Fur<strong>the</strong>rmore, <strong>the</strong>y point<br />
out that measuring only <strong>the</strong> raw kernel execution time is meaningless if results produced<br />
by <strong>the</strong> GPU have to be used by <strong>the</strong> CPU afterwards. In this case <strong>the</strong> kernel may be fast<br />
but <strong>the</strong> execution time <strong>of</strong> <strong>the</strong> whole application would also include <strong>the</strong> time for copying<br />
<strong>the</strong> results from GPU to CPU. Ignoring transfer times in publications also complicates to<br />
understand whe<strong>the</strong>r it is even worth to perform <strong>the</strong> execution on GPU or not. Figure 5<br />
and Figure 6 show examples that demonstrate <strong>the</strong> impact <strong>of</strong> data transfer times.<br />
Surprisingly, many studied publications including [9], [14], [27] and [20] ignore <strong>the</strong><br />
time for memory transfers completely in <strong>the</strong>ir performance comparisons. No (or unclear)<br />
information on memory transfers is provided by [12], [18] and [24]. Bakkum et al. [4],<br />
who achieved 20X - 70X speedups porting database operations to GPU, include memory<br />
transfers in <strong>the</strong>ir comparisons. They state that excluding memory transfers would lead<br />
to speedups close to 200X, which would not be a fair comparison though. Authors <strong>of</strong><br />
[8], [23] and [25] also consider memory transfer times and achieve speedups, which are<br />
correspondingly low.<br />
44
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
Figure 5: Execution times <strong>of</strong> <strong>the</strong> SpMV kernel for growing matrices as input. The time for<br />
moving <strong>the</strong> matrix to GPU’s memory dominates <strong>the</strong> overall execution time <strong>of</strong> <strong>the</strong> kernel<br />
extremely. Figure is adapted from [10].<br />
Figure 6: Measured performance for stencil computations. Blue bars represent GPU<br />
implementations, o<strong>the</strong>r bars represent CPU based implementations. Considering time<br />
for data transfers to/from CPU, degrades <strong>the</strong> GPU performance dramatically. Figure is<br />
adapted from [8].<br />
45
Denis Dridger<br />
5 Discussion: is <strong>the</strong> “100X GPU Speedup” Just a Myth?<br />
5.1 Motivation<br />
In <strong>the</strong> recent years we have seen many claims concerning program speedups on GPU. To<br />
put it roughly, <strong>the</strong>se claims <strong>of</strong>ten sound like this: “You can compute a matrix 100 times<br />
faster using a graphics card instead <strong>of</strong> a CPU” or “Password cracking on a graphics card is<br />
200 times faster than on a CPU”. But is this true? Can one state that GPUs are that much<br />
better just like this? Lee et al. [14] say no. Moreover, <strong>the</strong>y argue that achieving such<br />
speedups is generally a myth. They argue that <strong>the</strong>re are several parameters that need to be<br />
considered to provide fair performance comparisons. And thus reported speedups would<br />
decrease significantly if adequately evaluated. In <strong>the</strong>ir work, <strong>the</strong>y reevaluated various<br />
claims that suggest GPU speedups about 100X, and ended up with much lower GPU<br />
speedups. Therefor <strong>the</strong>y implemented 14 algorithms on CPU and GPU respectively, and<br />
managed to damp down <strong>the</strong> originally reported GPU speedups for <strong>the</strong>se algorithms to an<br />
averaged speedup <strong>of</strong> 2.5X. The trick was, to use a state-<strong>of</strong>-<strong>the</strong>-art Intel CPU along with<br />
several code optimization techniques.<br />
Motivated by Lee et al., we investigated several publications in order to figure out<br />
which parameters this are and how <strong>the</strong>y might influence <strong>the</strong> speedups. In fact, we found<br />
evidence that many performance comparisons seem to be taken out <strong>of</strong> context. Especially<br />
noticeable is <strong>the</strong> fact that authors, who report huge speedups, tend to conceal important<br />
details <strong>of</strong> <strong>the</strong>ir performance comparisons, or compare <strong>the</strong>ir GPU implementations to<br />
poorly optimized or outdated CPUs. Some examples for such publications were already<br />
mentioned in section 2.2 and 4.2 respectively. Moreover, as figured out in section 4.4<br />
almost all publications (especially again those from section 2.2) ignore <strong>the</strong> time needed to<br />
transfer <strong>the</strong> data to and from <strong>the</strong> <strong>the</strong> GPU. Since considering <strong>the</strong> transfers is essential in<br />
real-life applications, <strong>the</strong> reported speedups would decrease even fur<strong>the</strong>r, because moving<br />
data is a very costly operation. Summing up, it is likely that <strong>the</strong>se speedups would<br />
decrease significantly if considering (just) <strong>the</strong>se two parameters. For example, if we assume<br />
that <strong>the</strong> program would be run parallelized on a quad-core CPU instead <strong>of</strong> on a<br />
single-threaded CPU, and memory transfers would account for “only” 2X <strong>of</strong> <strong>the</strong> speedup,<br />
<strong>the</strong>n a 100X speedup would (<strong>the</strong>oretically) decrease to 12.5X. If we <strong>the</strong>n apply elaborate<br />
optimizations to <strong>the</strong> CPU code, we might end up with a GPU speedup <strong>of</strong> less than 10X,<br />
which would be near to <strong>the</strong> results achieved by Lee et al.<br />
5.2 Intention <strong>of</strong> <strong>the</strong> Author<br />
So far we can say that reported speedups should be interpreted with care in order to<br />
obtain a meaningful outcome. How and whe<strong>the</strong>r <strong>the</strong> elaborated influence parameters play<br />
a role during performance comparisons, depends on <strong>the</strong> author himself. Anderson et al.<br />
[2] point out that <strong>the</strong>re are two distinct perspectives from which to make comparisons:<br />
application developers and computer architecture researchers. Application developers<br />
focus on demonstrating new application capabilities designing algorithms for a particular<br />
46
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
domain under a set <strong>of</strong> implementation constraints. Hence, when application developers<br />
report a 100x speedup using a GPU, <strong>the</strong> speedup numbers should not be misinterpreted<br />
as architectural comparisons, claiming that GPUs are 100x faster than CPUs.<br />
Architecture researchers, on <strong>the</strong> o<strong>the</strong>r hand, do not focus on a specific application domain<br />
but design architectures, which perform well for a variety <strong>of</strong> application domains. To<br />
evaluate designed architectures researchers <strong>of</strong>ten use benchmark suites, ra<strong>the</strong>r <strong>the</strong>n elaborated<br />
data structures and algorithms that solve a concrete problem. Benchmark suites are<br />
designed to evaluate architectural features instead <strong>of</strong> providing great speedups. Anderson<br />
also asks that every future comparison should have enough reference information, which<br />
allows to reproduce <strong>the</strong> reported speedups.<br />
As mentioned before, Lee et al. [14] state that published GPU speedups numbers are<br />
exaggerated in general, and that CPUs might keep up with GPUs in many cases. However,<br />
<strong>the</strong> fact that Lee et al. are members <strong>of</strong> <strong>the</strong> Intel Corporation, which does not want to lose<br />
<strong>the</strong> market share for general purpose computing, may suggest <strong>the</strong>ir intention. In o<strong>the</strong>r<br />
words, <strong>the</strong> intention <strong>of</strong> Lee et al. was to push down <strong>the</strong> speedup numbers that were<br />
achieved using GPUs. However, if we consult our influence parameters we discover that<br />
Lee et al. use an outdated GPU for <strong>the</strong>ir comparisons, while next generation GPUs were<br />
already available, which could provide as much as twice <strong>the</strong> performance. In addition,<br />
Lee et al. do not detail <strong>the</strong> implementations used for comparison, which again makes it<br />
hard to comprehend or reproduce <strong>the</strong> results.<br />
5.3 The Answer<br />
The answer on our question whe<strong>the</strong>r GPUs can achieve 100X speedups over CPUs or<br />
not is: it depends. The claim that a GPU implementation is 100X faster than <strong>the</strong> legacy<br />
sequential implementation, is valid and may be <strong>of</strong> great interest to application developers<br />
using <strong>the</strong> legacy implementation [2]. However, one can push down this speedup by<br />
adapting <strong>the</strong> introduced influence parameters nearly arbitrarily. Even if <strong>the</strong> parallelism <strong>of</strong><br />
a CPU implementation is limited by <strong>the</strong> number <strong>of</strong> available cores, one still can argue that<br />
adding ano<strong>the</strong>r CPU sockets will match <strong>the</strong> GPU in performance, as shown by Vuduc et<br />
al. in [26].<br />
Never<strong>the</strong>less, we can agree that GPUs still have <strong>the</strong> potential to significantly accelerate<br />
parallel algorithms. Even though many reported speedups are exaggerated, today’s,<br />
and especially future GPUs, are capable <strong>of</strong> providing notable speedups for certain, well<br />
optimized applications.<br />
6 Conclusions<br />
In this work we figured out that reported speedups <strong>of</strong> GPU accelerated algorithms <strong>of</strong>ten<br />
appear to be exaggerated. Therefor we first had a look at <strong>the</strong> basic concepts <strong>of</strong> general<br />
purpose computing on GPUs. Here <strong>the</strong> GPU architecture, its programming model and<br />
application examples were presented. Next, several parameters were discussed that influ-<br />
47
Denis Dridger<br />
ence <strong>the</strong> performance comparisons. Therefor several publications were studied that deal<br />
with algorithm acceleration on GPUs. We have seen that <strong>the</strong> parameters (1) chosen application,<br />
(2) chosen hardware, (3) performed code optimizations and (4) consideration <strong>of</strong><br />
memory transfers, have a very influential impact on <strong>the</strong> resulting speedup. Many authors<br />
however, do not provide very fair performance comparisons adapting <strong>the</strong>se parameters<br />
so that <strong>the</strong>ir GPU implementation outperform <strong>the</strong> corresponding CPU implementation<br />
by far. Adapting <strong>the</strong> parameters is mainly driven by authors intention, which can lead to<br />
speedups <strong>of</strong> 100X (and far beyond) over <strong>the</strong> CPU implementation. In turn, conducting absolutely<br />
“fair” performance comparisons <strong>of</strong>ten shows that GPU implementations provide<br />
reasonable speedups or even do not outperform <strong>the</strong> corresponding CPU implementations<br />
at all.<br />
References<br />
[1] D. Luebke S. Green J. E. Stone J. C. Phillips . D. Owens, M. Houston. "GPU<br />
Computing". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> IEEE, pages 879 – 899, Washington, DC, USA,<br />
2011. IEEE Computer Society.<br />
[2] Michael Anderson, Bryan Catanzaro, Jike Chong, Ekaterina Gonina, Kurt Keutzer,<br />
Chao-Yue Lai, Mark Murphy, David Sheffield, Bor-Yiing Su, and Narayanan Sundaram.<br />
"Considerations When Evaluating Microprocessor Platforms". In <strong>Proceedings</strong><br />
<strong>of</strong> <strong>the</strong> 3rd USENIX conference on Hot topic in parallelism, HotPar’11, pages<br />
1–1, Berkeley, CA, USA, 2011. USENIX Association.<br />
[3] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,<br />
Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John<br />
Shalf, Samuel Webb Williams, and Ka<strong>the</strong>rine A. Yelick. "The Landscape <strong>of</strong> Parallel<br />
Computing Research: A View from Berkeley". Technical Report UCB/EECS-2006-<br />
183, EECS Department, University <strong>of</strong> California, Berkeley, Dec 2006.<br />
[4] Peter Bakkum and Kevin Skadron. "Accelerating SQL Database Operations on a<br />
GPU With CUDA". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 3rd Workshop on General-Purpose Computation<br />
on Graphics Processing Units, GPGPU ’10, pages 94–103, New York, NY,<br />
USA, 2010. ACM.<br />
[5] NVIDIA Corporation. Compute unified device architecture programming guide version<br />
2.0. http://www.nvidia.com/object/cudadevelop.htm, 2008.<br />
[6] NVIDIA Corporation. "NVIDIA CUDA C Programming Guide". 2010.<br />
[7] NVIDIA Corporation. Nvidia cuda zone. http://www.nvidia.com/<br />
object/cuda_home.html, 2011.<br />
48
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
[8] Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter,<br />
Leonid Oliker, David Patterson, John Shalf, and Ka<strong>the</strong>rine Yelick. "Stencil Computation<br />
Optimization and Auto-tuning on State-<strong>of</strong>-<strong>the</strong>-art Multicore Architectures". In<br />
<strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2008 ACM/IEEE conference on Supercomputing, SC ’08, pages<br />
4:1–4:12, Piscataway, NJ, USA, 2008. IEEE Press.<br />
[9] Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli.<br />
"High Performance Discrete Fourier Transforms on Graphics Processors".<br />
In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2008 ACM/IEEE conference on Supercomputing, SC ’08,<br />
pages 2:1–2:12, Piscataway, NJ, USA, 2008. IEEE Press.<br />
[10] Chris Gregg and Kim Hazelwood. "Where is <strong>the</strong> Data? Why You Cannot Debate<br />
CPU vs. GPU Performance Without <strong>the</strong> Answer". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> IEEE International<br />
Symposium on Performance Analysis <strong>of</strong> Systems and S<strong>of</strong>tware, ISPASS<br />
’11, pages 134–144, Washington, DC, USA, 2011. IEEE Computer Society.<br />
[11] Nail A. Gumerov and Ramani Duraiswami. "Fast Multipole Methods on Graphics<br />
Processors". J. Comput. Phys., 227:8290–8313, September 2008.<br />
[12] Guang Hu, Jianhua Ma, and Benxiong Huang. "Password Recovery for RAR Files<br />
Using CUDA". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2009 Eighth IEEE International Conference<br />
on Dependable, Autonomic and Secure Computing, DASC ’09, pages 486–490,<br />
Washington, DC, USA, 2009. IEEE Computer Society.<br />
[13] Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen,<br />
Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. "FAST: Fast<br />
Architecture Sensitive Tree Search on modern CPUs and GPUs". In <strong>Proceedings</strong><br />
<strong>of</strong> <strong>the</strong> 2010 international conference on Management <strong>of</strong> data, SIGMOD ’10, pages<br />
339–350, New York, NY, USA, 2010. ACM.<br />
[14] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim,<br />
Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty,<br />
Per Hammarlund, Ronak Singhal, and Pradeep Dubey. "Debunking <strong>the</strong> 100X GPU<br />
vs. CPU myth: an Evaluation <strong>of</strong> Throughput Computing on CPU and GPU". In<br />
<strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 37th annual international symposium on Computer architecture,<br />
ISCA ’10, pages 451–460, New York, NY, USA, 2010. ACM.<br />
[15] Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty,<br />
J. Alan Miller, and Michael Upton. "Hyper-Threading Technology Architecture and<br />
Microarchitecture". Intel Technology Journal, 6(1):4–16, 2002.<br />
[16] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron<br />
Lefohn, and Timothy J. Purcell. "A Survey <strong>of</strong> General-Purpose Computation on<br />
Graphics <strong>Hardware</strong>". Computer Graphics Forum, 26(1):80–113, 2007.<br />
49
Denis Dridger<br />
[17] S. J. Pennycook, S. D. Hammond, S. A. Jarvis, and G. R. Mudalige. "Performance<br />
Analysis <strong>of</strong> a Hybrid MPI/CUDA Implementation <strong>of</strong> <strong>the</strong> NASLU Benchmark". SIG-<br />
METRICS Perform. Eval. Rev., 38:23–29, March 2011.<br />
[18] Pham Hong Phong, Phan Duc Dung, Duong Nhat Tan, Nguyen Huu Duc, and<br />
Nguyen Thanh Thuy. "Password Recovery for Encrypted ZIP Archives Using<br />
GPUs". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2010 Symposium on Information and Communication<br />
Technology, SoICT ’10, pages 28–33, New York, NY, USA, 2010. ACM.<br />
[19] Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B.<br />
Kirk, and Wen-mei W. Hwu. "Optimization Principles and Application Performance<br />
Evaluation <strong>of</strong> a Multithreaded GPU Using CUDA". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 13th ACM<br />
SIGPLAN Symposium on Principles and practice <strong>of</strong> parallel programming, PPoPP<br />
’08, pages 73–82, New York, NY, USA, 2008. ACM.<br />
[20] Nadathur Satish, Mark Harris, and Michael Garland. "Designing Efficient Sorting<br />
Algorithms for Manycore GPUs". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2009 IEEE International<br />
Symposium on Parallel&Distributed Processing, IPDPS ’09, pages 1–10, Washington,<br />
DC, USA, 2009. IEEE Computer Society.<br />
[21] Dana Schaa and David Kaeli. "Exploring <strong>the</strong> Multiple-GPU Design Space". In<br />
<strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2009 IEEE International Symposium on Parallel&Distributed<br />
Processing, IPDPS ’09, pages 1–12, Washington, DC, USA, 2009. IEEE Computer<br />
Society.<br />
[22] Mark Silberstein, Assaf Schuster, Dan Geiger, Anjul Patney, and John D. Owens.<br />
"Efficient Computation <strong>of</strong> Sum-products on GPUs Through S<strong>of</strong>tware-managed<br />
Cache". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 22nd annual international conference on Supercomputing,<br />
ICS ’08, pages 309–318, New York, NY, USA, 2008. ACM.<br />
[23] Mikhail Smelyanskiy, David Holmes, Jatin Chhugani, Alan Larson, Douglas M.<br />
Carmean, Dennis Hanson, Pradeep Dubey, Kurt Augustine, Daehyun Kim, Alan<br />
Kyker, Victor W. Lee, Anthony D. Nguyen, Larry Seiler, and Richard Robb. "Mapping<br />
High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and<br />
Many-Core Architectures". IEEE Transactions on Visualization and Computer<br />
Graphics, 15:1563–1570, November 2009.<br />
[24] F. Vazquez, G. Ortega, J. J. Fernandez, and E. M. Garzon. "Improving <strong>the</strong> Performance<br />
<strong>of</strong> <strong>the</strong> Sparse Matrix Vector Product with GPUs". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2010<br />
10th IEEE International Conference on Computer and Information Technology, CIT<br />
’10, pages 1146–1151, Washington, DC, USA, 2010. IEEE Computer Society.<br />
[25] Vasily Volkov and James W. Demmel. "Benchmarking GPUs to Tune Dense Linear<br />
Algebra". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2008 ACM/IEEE conference on Supercomputing,<br />
SC ’08, pages 31:1–31:11, Piscataway, NJ, USA, 2008. IEEE Press.<br />
50
CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer<br />
[26] Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay<br />
Shringarpure. "On <strong>the</strong> Limits <strong>of</strong> GPU Acceleration". In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2nd<br />
USENIX conference on Hot topics in parallelism, HotPar’10, pages 13–13, Berkeley,<br />
CA, USA, 2010. USENIX Association.<br />
[27] Zhiyi Yang, Yating Zhu, and Yong Pu. "Parallel Image Processing Based on CUDA".<br />
Computer Science and S<strong>of</strong>tware Engineering, International Conference on, 3:198–<br />
201, 2008.<br />
51
Will Dark Silicon Limit Multicore Scaling?<br />
Christoph Kleineweber<br />
University <strong>of</strong> Paderborn<br />
chkl@mail.uni-paderborn.de<br />
January 12, 2012<br />
Abstract<br />
The performance <strong>of</strong> processors has grown exponentially over decades, but it<br />
is doubtful if this scaling holds with upcoming multicore processors. To answer<br />
this question, this work reflects a study published by Esmaeilzadeh [7] et al., which<br />
presents an analytical model to make scaling predictions, relying on empirical data<br />
<strong>of</strong> current processor technologies. One <strong>of</strong> <strong>the</strong> most significant results is that dark<br />
silicon might become a relevant problem. Dark silicon is <strong>the</strong> fraction <strong>of</strong> <strong>the</strong> die area,<br />
which is unused, caused by power or application parallelism limits. We came to <strong>the</strong><br />
conclusion that <strong>the</strong> level <strong>of</strong> parallelism is <strong>the</strong> most relevant reason for dark silicon.<br />
1 Introduction<br />
The exa-scale challenge is a very frequently discussed topic in <strong>the</strong> area <strong>of</strong> computerengineering.<br />
During <strong>the</strong> last decades, a continuous performance growth <strong>of</strong> CPUs was<br />
retained. While <strong>the</strong> energy efficiency was improved with upcoming technology generations,<br />
<strong>the</strong> total power consumption <strong>of</strong> a CPU has grown with <strong>the</strong> performance in <strong>the</strong><br />
past.<br />
To overcome an exorbitant growth <strong>of</strong> power consumption, multicore CPUs and and<br />
GPUs were established to avoid <strong>the</strong> demand to make fur<strong>the</strong>r increases <strong>of</strong> <strong>the</strong> used single<br />
core frequency. This strategy implies <strong>the</strong> demand <strong>of</strong> applications with a certain level<br />
<strong>of</strong> parallelism to make performance improvements. Additionally <strong>the</strong> memory and communication<br />
bandwidth is a still existing challenge. To answer <strong>the</strong> question if <strong>the</strong> current<br />
technology may fulfill <strong>the</strong> upcoming performance needs with acceptable energy and chip<br />
area demands, Esmaeilzadeh et al. [7] investigated in a detailed analysis <strong>of</strong> different<br />
models and empirical measurements <strong>of</strong> currently available devices and tried to estimate<br />
<strong>the</strong> scalability <strong>of</strong> upcoming technologies with this knowledge. An interesting aspect in<br />
this area is <strong>the</strong> fraction <strong>of</strong> dark silicon in upcoming processor generations. Dark silicon is<br />
<strong>the</strong> part <strong>of</strong> a die, which is unused, e.g. caused by missing parallelism in an application or<br />
by power constraints. In <strong>the</strong> worst case, dark silicon may limit <strong>the</strong> possible performance<br />
53
Christoph Kleineweber<br />
improvements <strong>of</strong> upcoming chip generations, even if <strong>the</strong> growth <strong>of</strong> chip complexity continues<br />
as in <strong>the</strong> past. This paper reflects <strong>the</strong> work <strong>of</strong> Esmaeilzadeh et al. and compares<br />
<strong>the</strong> results to alternative models.<br />
1.1 Overview<br />
The remainder <strong>of</strong> this paper is structured as follows: The rest <strong>of</strong> this section introduces<br />
basic models related to scaling compute performance and explains <strong>the</strong> different types<br />
<strong>of</strong> considered multicore topologies. Section 2 presents an empirical study on current<br />
processor technologies and makes predictions on upcoming technologies and <strong>the</strong> resulting<br />
performance. This section consists <strong>of</strong> a device model, a core model and a multicore<br />
model. Section 3 concludes scaling limitations and presents sources <strong>of</strong> dark silicon. The<br />
following section 4 summarizes related word. The last chapter concludes and discusses<br />
<strong>the</strong> feasibility <strong>of</strong> <strong>the</strong> shown work.<br />
1.2 Basic Models<br />
In <strong>the</strong> past, different performance and scaling models have been proposed. These models<br />
are necessary to predict <strong>the</strong> upcoming processor technology and performance. This<br />
section presents Moore’s Law, Amdahl’s, and Pollack’s Rule. In <strong>the</strong> remainder <strong>of</strong> this<br />
paper, we will discuss <strong>the</strong> question if <strong>the</strong>se models are sufficient to make detailed scaling<br />
predictions and particularly predict <strong>the</strong> fraction <strong>of</strong> dark silicon.<br />
1.2.1 Moore’s Law<br />
Gordon E. Moore, one <strong>of</strong> <strong>the</strong> founders <strong>of</strong> Intel, noticed in 1965 that <strong>the</strong> complexity <strong>of</strong><br />
integrated circuits doubles every 18 months [11]. Complexity means in this context <strong>the</strong><br />
number <strong>of</strong> transistors per die area. Unexpectedly, this rule holds for decades and <strong>the</strong>reby it<br />
was <strong>the</strong> base for <strong>the</strong> appeared growth <strong>of</strong> compute performance. An interesting question is<br />
what <strong>the</strong> effect on <strong>the</strong> performance <strong>of</strong> upcoming increases <strong>of</strong> processor complexity might<br />
be, even if Moore’s Law holds.<br />
1.2.2 Pollack’s Rule<br />
One model to answer <strong>the</strong> question <strong>of</strong> <strong>the</strong> effect <strong>of</strong> an increased processor complexity is<br />
Pollack’s Rule [4]. Pollack’s Rule proposes that <strong>the</strong> increase <strong>of</strong> <strong>the</strong> performance <strong>of</strong> a chip<br />
is proportional to <strong>the</strong> growth <strong>of</strong> <strong>the</strong> square root <strong>of</strong> its complexity. This rule implies for<br />
instance that doubling <strong>the</strong> processor complexity results only in a performance growth <strong>of</strong><br />
40 %.<br />
1.2.3 Amdahl’s Law<br />
One important question while analyzing processor performance is <strong>the</strong> speedup caused<br />
by a new processor generation. Therefore Amdahl formulated a very general rule [1] in<br />
54
Will Dark Silicon Limit Multicore Scaling?<br />
1967, which enables us to compare two processor generations. According to Amdahl, <strong>the</strong><br />
speedup <strong>of</strong> a system is<br />
Speedup =<br />
1<br />
(1 − f) + f<br />
S<br />
where f represents <strong>the</strong> fraction that is optimized by an improved system, e.g. <strong>the</strong> parts <strong>of</strong><br />
<strong>the</strong> code, and S represents <strong>the</strong> speedup <strong>of</strong> this fraction. We will see some corollaries <strong>of</strong><br />
Amdahl’s Law, fitted to multicore processors in section 2.3.1.<br />
1.3 Multicore Topologies<br />
We consider different types <strong>of</strong> processors for <strong>the</strong> following analysis, which are also presented<br />
by Esmaeilzadeh et al. [7]. First we distinguish between regular multicore processors<br />
and GPU like processors, which are able to execute many threads per core. For each<br />
<strong>of</strong> <strong>the</strong>se types, we consider <strong>the</strong> following topologies.<br />
1.3.1 Symmetric Multicore<br />
A symmetric multicore processor is <strong>the</strong> most obvious one and consists <strong>of</strong> multiple, identical<br />
cores. The parallel fraction <strong>of</strong> a program is distributed across each <strong>of</strong> <strong>the</strong>se cores.<br />
Running serial code certainly results in executing <strong>the</strong> whole code on one single core,<br />
whereat large parts <strong>of</strong> <strong>the</strong> processor may be unused.<br />
1.3.2 Asymmetric Multicore<br />
This kind <strong>of</strong> multiprocessor consists <strong>of</strong> one large core and multiple small cores <strong>of</strong> <strong>the</strong><br />
same type. Typically <strong>the</strong> performance <strong>of</strong> <strong>the</strong> large core is much higher than <strong>the</strong> smaller<br />
cores’ performance, thus sequential tasks can be executed with a good performance on<br />
<strong>the</strong> large core and parallel tasks on <strong>the</strong> small cores and <strong>the</strong> large core.<br />
1.3.3 Dynamic Multicore<br />
The dynamic multicore topology is very similar to <strong>the</strong> asymmetric multicore topology.<br />
Contrary to <strong>the</strong> asymmetric multicore, ei<strong>the</strong>r <strong>the</strong> large core or <strong>the</strong> small cores are usable<br />
at <strong>the</strong> same time. During <strong>the</strong> execution <strong>of</strong> a sequential task, <strong>the</strong> small cores are shut down<br />
and during <strong>the</strong> execution <strong>of</strong> parallel tasks, <strong>the</strong> large core is shut down.<br />
1.3.4 Composed Multicore<br />
The composed multicore topology, in literature also called fused multicore, consists <strong>of</strong><br />
multiple small cores, which can be composed to one large core. This architecture implies<br />
<strong>the</strong> same behavior as <strong>the</strong> dynamic multicore topology where ei<strong>the</strong>r one large core or<br />
multiple small cores can be used at <strong>the</strong> same time.<br />
55<br />
(1)
Christoph Kleineweber<br />
2 Performance Models<br />
This section describes three models, used for estimating <strong>the</strong> upcoming performance scaling.<br />
We model future devices, CPU cores and multicore CPUs and combine <strong>the</strong>m to<br />
make predictions on future compute performance and <strong>the</strong> impact <strong>of</strong> dark silicon. The first<br />
device model describes upcoming semiconductor technologies. In <strong>the</strong> next step, we consider<br />
a core model to estimate <strong>the</strong> upcoming performance per core by having a look at <strong>the</strong><br />
performance per die area and power consumption <strong>of</strong> current processors. In combination<br />
with <strong>the</strong> device model, we can estimate <strong>the</strong> core performance <strong>of</strong> upcoming processors. In<br />
<strong>the</strong> last step we estimate <strong>the</strong> upcoming multicore speedup by combining <strong>the</strong> results from<br />
<strong>the</strong> core model with Amdahl’s Law and a second, more realistic model.<br />
2.1 Device Model<br />
The authors <strong>of</strong> [7] presented two different device scaling models. The first one is based on<br />
<strong>the</strong> ITRS technology roadmap 1 , <strong>the</strong> second model is a more conservative one, presented<br />
by Borkar [5]. Both <strong>of</strong> <strong>the</strong> models are presenting a roadmap <strong>of</strong> upcoming technologies,<br />
which are <strong>the</strong> base for fur<strong>the</strong>r predictions made in <strong>the</strong> remainder <strong>of</strong> this section. Both are<br />
presenting estimations for <strong>the</strong> upcoming technologies with a feature size from 45 nm to<br />
8 nm, <strong>the</strong> expected frequency, voltage, capacitance and power scaling factor. The results<br />
<strong>of</strong> both roadmaps are shown in Figure 1. We have to consider that <strong>the</strong> ITRS roadmap<br />
assumes different types <strong>of</strong> transistors than <strong>the</strong> conservative projection.<br />
Figure 1: Scaling factors for ITRS and conservative projections [7]<br />
1 Online at http://www.itrs.net<br />
56
2.2 Core Model<br />
2.2.1 Current Performance Behavior<br />
Will Dark Silicon Limit Multicore Scaling?<br />
Esmaeilzadeh et al. used empirical performance data, measured by <strong>the</strong> SPECmarks<br />
benchmark <strong>of</strong> 152 real processors from 600 nm to 45 nm. The benchmark results, shown<br />
in Figure 2, were taken from <strong>the</strong> SPEC website 2 . They presented <strong>the</strong> single-threaded core<br />
performance, called q, compared to <strong>the</strong> power consumption P (q) and chip area A(q). Any<br />
details on <strong>the</strong> processor and system architecture are not considered in this model. The performance<br />
q is given as SPEC CPU2006 score. The power consumption <strong>of</strong> a processor core<br />
was taken from <strong>the</strong> data sheets. The Thermal Design Power (TDP) was considered in this<br />
study. This value is <strong>the</strong> power, a processor can dissipate without reaching <strong>the</strong> junction<br />
temperature <strong>of</strong> <strong>the</strong> transistors. To build a model to predict upcoming performance, only<br />
one technology generation, in this case 45 nm, was considered (Figure 3). To estimate <strong>the</strong><br />
core area, die photos were used. The area consumed by level 2 and level 3 caches were<br />
excluded.<br />
Power and area constraints were considered decoupled in this study. Previous studies<br />
on multicore performance used Pollack’s Rule and assume power consumption to be proportional<br />
to <strong>the</strong> number <strong>of</strong> transistors, which means being proportional to <strong>the</strong> chip area,<br />
when only one feature size is considered. Given that frequency and voltage are not scaling<br />
as historically done, Pollack’s Rule is not practical for modeling <strong>the</strong> power consumption<br />
<strong>of</strong> a current or upcoming processor core.<br />
2.2.2 Estimate Optimal Design Points<br />
Figure 2: Power/Performance across nodes [7]<br />
To point out <strong>the</strong> most relevant design points, <strong>the</strong> Pareto frontier <strong>of</strong> <strong>the</strong> 45 nm design space<br />
was derived. For <strong>the</strong> power/performance design space, a cubic polynomial P (q) was<br />
assumed. The Pareto frontier <strong>of</strong> <strong>the</strong> area/performance design space A(q) was assumed<br />
2 Online at http://www.spec.org<br />
57
Christoph Kleineweber<br />
Figure 3: Power/Performance frontier, 45 nm [7]<br />
Figure 4: Area/Performance frontier, 45 nm [7]<br />
as a quadratic polynomial. This choice was taken according to Pollack’s Rule, which<br />
assumes a quadratic increase <strong>of</strong> chip area, with an performance increase. The coefficients<br />
<strong>of</strong> <strong>the</strong> polynomials P (q) and A(q) were fitted using <strong>the</strong> least square regression method.<br />
The results are presented in Figure 3 and Figure 4.<br />
2.2.3 Predicting Upcoming Performance<br />
To make predictions on upcoming processor core performance, we combine <strong>the</strong> results<br />
from <strong>the</strong> presented device model and <strong>the</strong> core model. Therefore, <strong>the</strong> 45 nm Pareto frontier<br />
was scaled to 8 nm and fitted to a new Pareto frontier for each technology. For that,<br />
<strong>the</strong> results from <strong>the</strong> device model (Section 2.1) were inserted to <strong>the</strong> data points <strong>of</strong> <strong>the</strong> core<br />
model. The SPECmark performance is <strong>the</strong>refore assumed as scaling with <strong>the</strong> frequency,<br />
which ignores aspects like <strong>the</strong> memory latency and bandwidth, thus <strong>the</strong> presented model<br />
has to be considered as an upper bound for upcoming processor performance. The predic-<br />
58
Will Dark Silicon Limit Multicore Scaling?<br />
tions, depending on <strong>the</strong> ITRS roadmap and <strong>the</strong> conservative model by Borkar are shown<br />
in Figure 5 and Figure 6.<br />
2.3 Multicore Model<br />
Figure 5: Conservative frontier scaling [7]<br />
Figure 6: ITRS frontier scaling [7]<br />
The next presented model estimates <strong>the</strong> possible scaling for multicore processors. We<br />
will consider two different scaling models, <strong>the</strong> first one is a corollary <strong>of</strong> Amdahl’s Law,<br />
<strong>the</strong> second one is a more realistic model, which was originally proposed by Guz et al. [8]<br />
and extended. This model is appliable to CPU- and GPU-like processors.<br />
2.3.1 Upper Bound by Amdahl’s Law<br />
To apply Amdahl’s Law to multicore processors, Hill and Marty [10] concluded <strong>the</strong><br />
speedup <strong>of</strong> all presented multicore topologies. This model can be considered as an upper<br />
59
Christoph Kleineweber<br />
bound for <strong>the</strong> multicore speedup. The model was extended to consider power and area<br />
constraints, but does not differentiate between CPU- and GPU-like processor architectures.<br />
The possible speedups, depending on <strong>the</strong> processor topology are shown be <strong>the</strong> equations<br />
2 to 9, where <strong>the</strong> possible number <strong>of</strong> cores is depending on <strong>the</strong> Chip area restrictions<br />
and <strong>the</strong> power restrictions. DIEAREA presents <strong>the</strong> maximum area budget and TDP <strong>the</strong><br />
power budget. The parameter q presents <strong>the</strong> performance <strong>of</strong> a singe core, <strong>the</strong> speedup<br />
is measured related to a baseline core with <strong>the</strong> performance qBaseline. The speedup <strong>of</strong> a<br />
single core cannot be larger than SU(q) = q/qBaseline.<br />
For <strong>the</strong> symmetric multicore topology, <strong>the</strong> parallel fraction <strong>of</strong> <strong>the</strong> code f is distributed<br />
over all NSym available cores, <strong>the</strong> serial fraction runs on only one core.<br />
NSym(q) = min( DIEAREA<br />
,<br />
A(q)<br />
T DP<br />
) (2)<br />
P (q)<br />
SpeedupSym(f, q) =<br />
1<br />
(1−f)<br />
SU (q) +<br />
f<br />
NSym(q)SU (q)<br />
For <strong>the</strong> asymmetric multicore topology, <strong>the</strong> large core dominates <strong>the</strong> area constraint<br />
and <strong>the</strong> small cores are dominating <strong>the</strong> power constraint. The variables qL and qS are<br />
describing <strong>the</strong> performance <strong>of</strong> <strong>the</strong> large core and <strong>the</strong> performance <strong>of</strong> a single small core.<br />
On this topology parallel code is executed on <strong>the</strong> large core and <strong>the</strong> small cores, sequential<br />
code is executed only on <strong>the</strong> large core.<br />
NAsym(qL, qS) = min( DIEAREA − A(qL)<br />
,<br />
A(qS)<br />
T DP − P (qL)<br />
SpeedupAsym(f, qL, qS) =<br />
1<br />
(3)<br />
) (4)<br />
P (qS)<br />
(1−f)<br />
SU (qL) +<br />
1<br />
NAsym(qL,qS)SU (qS)+SU (qL)<br />
Having a dynamic multicore topology, <strong>the</strong> area is still bounded by <strong>the</strong> area <strong>of</strong> <strong>the</strong> large<br />
core, if <strong>the</strong> area constraint is <strong>the</strong> dominating part. The number <strong>of</strong> small cores is not limited<br />
by <strong>the</strong> power consumption <strong>of</strong> <strong>the</strong> large core. For this topology, parallel core is executed<br />
only on <strong>the</strong> small cores.<br />
NDyn(qL, qS) = min( DIEAREA − A(qL)<br />
,<br />
A(qS)<br />
T DP<br />
) (6)<br />
P (qS)<br />
SpeedupDyn(f, qL, qS) =<br />
1<br />
(1−f)<br />
SU (qL) +<br />
f<br />
NDyn(qL,qS)SU (qS)<br />
One <strong>of</strong> <strong>the</strong> characteristics <strong>of</strong> <strong>the</strong> composed multicore topology is an area overhead,<br />
caused by <strong>the</strong> composed technology. The parameter τ describes this overhead. The model<br />
contains <strong>the</strong> assumption that <strong>the</strong> composed core has <strong>the</strong> same performance and power<br />
consumption as a scaled up single core. The execution behavior <strong>of</strong> parallel and sequential<br />
code is similar to <strong>the</strong> dynamic multicore.<br />
60<br />
(5)<br />
(7)
2.3.2 Realistic Model<br />
Will Dark Silicon Limit Multicore Scaling?<br />
NComposed(qL, qS) = min( DIEAREA T DP − P (qL)<br />
, ) (8)<br />
(1 + τ)A(qS) P (qS)<br />
SpeedupComposed(f, qL, qS) =<br />
1<br />
(1−f)<br />
SU (qL) +<br />
f<br />
NComposed(qL,qS)SU (qS)<br />
The next presented model is a more realistic model on <strong>the</strong> speedup <strong>of</strong> upcoming multicore<br />
processors. This model also considers technological details like <strong>the</strong> number <strong>of</strong> threads<br />
per core, and <strong>the</strong>reby <strong>the</strong> difference between CPU- and GPU-like architectures, <strong>the</strong> cache<br />
behavior, <strong>the</strong> memory bandwidth, <strong>the</strong> frequency, or <strong>the</strong> cycles per instruction (CPI) value.<br />
Also important for <strong>the</strong> performance <strong>of</strong> a processor is <strong>the</strong> used application. The application<br />
behavior is characterized by <strong>the</strong> level <strong>of</strong> parallelism and <strong>the</strong> memory access behavior.<br />
The performance <strong>of</strong> a fully parallel application, measured by <strong>the</strong> number <strong>of</strong> instructions<br />
per second can be calculated by equation 10.<br />
P erf = min(N freq<br />
η<br />
CP Iexe<br />
BWmax<br />
) (10)<br />
rmmL1b<br />
Thereby η represents <strong>the</strong> core utilization, which is depending on <strong>the</strong> memory behavior,<br />
rm is <strong>the</strong> fraction <strong>of</strong> instructions with memory access, mL1 is <strong>the</strong> predicted miss rate <strong>of</strong><br />
<strong>the</strong> first level cache and b is <strong>the</strong> number <strong>of</strong> bytes per memory access. The CP Iexe value<br />
and <strong>the</strong> frequency were estimated by <strong>the</strong> presented Pareto frontiers. Details on <strong>the</strong> values<br />
are explained by [7].<br />
To model application characteristics, PARSC applications were considered from previous<br />
studies [2], [3]. The level <strong>of</strong> parallelism f was obtained from this using Amdahl’s<br />
Law between 0.75 and 0.9999, depending on <strong>the</strong> considered benchmark.<br />
Now we compute <strong>the</strong> serial performance P erfs and parallel performance P erfP for<br />
each type <strong>of</strong> multicore processor using equation 10. The number <strong>of</strong> cores N is computed<br />
using <strong>the</strong> topology dependent equations 2, 4, 6, and 8. We are considering a 45 nm<br />
Nehalem core as <strong>the</strong> baseline performance P erfB. Now we obtain a speedup SSerial =<br />
P erfS/P erfB for <strong>the</strong> serial part <strong>of</strong> <strong>the</strong> benchmark and SP arallel = P erfP /P erfB for <strong>the</strong><br />
parallel part. The total speedup is given by equation 11 for each <strong>of</strong> <strong>the</strong> topologies.<br />
2.4 Combining <strong>the</strong> Models<br />
1<br />
Speedup = 1−f<br />
SSerial +<br />
f<br />
SP arallel<br />
In this section we are putting all things toge<strong>the</strong>r and predicting <strong>the</strong> performance <strong>of</strong> an upcoming<br />
multicore processor. We are assuming a power limit <strong>of</strong> 125 W and an area budget<br />
<strong>of</strong> 111 mm 2 , which corresponds to a Nehalem based 4-core processor at 45 nm technology,<br />
excluding level 2 and level 3 caches. For this prediction each area/performance<br />
61<br />
(9)<br />
(11)
Christoph Kleineweber<br />
design point <strong>of</strong> <strong>the</strong> Pareto frontier is considered. In <strong>the</strong> next step, iteratively one core<br />
is added in each step and <strong>the</strong> new power consumption and speedup is computed. The<br />
speedup is computed using <strong>the</strong> upper bound with Amdahl’s Law and <strong>the</strong> more realistic<br />
model. The power consumption is computed using <strong>the</strong> power/performance Pareto frontier.<br />
The iteration stops when <strong>the</strong> power or area limit is reached or we see a performance<br />
decrease. The difference between <strong>the</strong> allocated chip area up to this step and <strong>the</strong> total<br />
area budget is <strong>the</strong> fraction <strong>of</strong> dark silicon. These steps are repeated for all scaled Pareto<br />
frontiers with both <strong>of</strong> <strong>the</strong> multicore performance models, considering GPU- and CPU-like<br />
processors. The power and area budget is kept constant. Detailed results <strong>of</strong> this model<br />
are presented by [7]. Esmaeilzadeh et al. came to <strong>the</strong> conclusion that using Amdahl’s<br />
Law <strong>the</strong> maximum speedup at 8 nm is 11.3 using <strong>the</strong> conservative device scaling and 59<br />
considering <strong>the</strong> ITRS roadmap. In both cases <strong>the</strong> typical number <strong>of</strong> cores is predicted to<br />
be smaller than 512. They assume that dark silicon will dominate in 2024 relying on <strong>the</strong><br />
ITRS roadmap.<br />
3 Scaling Limitations and Dark Silicon<br />
Figure 7: Dark silicon bottleneck relaxation using CPU organization and dynamic topology<br />
at 8 nm with ITRS scaling [7]<br />
From our previous observations, we know obviously that limited application parallelism<br />
and a limited power budget are <strong>the</strong> main sources <strong>of</strong> dark silicon. To make a more<br />
detailed analysis, which <strong>of</strong> <strong>the</strong>se factors may dominate, we have a closer look to a hypo<strong>the</strong>tical<br />
CPU-like processor in 8 nm technology derived from <strong>the</strong> ITRS roadmap. In <strong>the</strong><br />
first part <strong>of</strong> Figure 7 only <strong>the</strong> power budget was limited. The different curves are presenting<br />
<strong>the</strong> speedup <strong>of</strong> <strong>the</strong> different PARSEC benchmarks, normalized to a 45 nm Nehalem<br />
62
Will Dark Silicon Limit Multicore Scaling?<br />
quad-core processor. We are considering a parallelism <strong>of</strong> 75 % to 99 % and assume that<br />
programmers can arrange this somehow. The markers are presenting <strong>the</strong> parallelism in<br />
<strong>the</strong> current implementations. We notice that most <strong>of</strong> <strong>the</strong> benchmarks have even at a level<br />
99 % parallelism only a speedup <strong>of</strong> 15.<br />
In <strong>the</strong> second part <strong>of</strong> Figure 7 we are considering a fixed limit <strong>of</strong> parallelism and vary<br />
<strong>the</strong> power budget. We see that eight <strong>of</strong> twelve benchmarks are accelerated not more than<br />
by a factor <strong>of</strong> ten, even with a practically unlimited power budget.<br />
This analysis shows that <strong>the</strong> level <strong>of</strong> parallelism is <strong>the</strong> most dominating source <strong>of</strong> dark<br />
silicon and a varying power budget is affecting <strong>the</strong> fraction <strong>of</strong> dark silicon more marginal.<br />
4 Alternative Models<br />
4.1 General Models<br />
Several o<strong>the</strong>r studies have been published in <strong>the</strong> area <strong>of</strong> performance and scaling predictions,<br />
but most <strong>of</strong> <strong>the</strong>m do not cover <strong>the</strong> generality and level <strong>of</strong> details as presented by<br />
Esmaeilzadeh et al. [7]. Examples are <strong>the</strong> corollaries to Amdahl’s Law by Hill and Marty<br />
[10] <strong>of</strong> <strong>the</strong> presentation <strong>of</strong> many core architectures by Borkar [4].<br />
4.2 Specialization Oriented Models<br />
A promising approach to overcome <strong>the</strong> problems pointed out by this work is using custom<br />
logic. Chung et al. [6] presented a model, which is combining traditional processors<br />
with custom logic, called unconventional cores (U-cores), implemented by FPGAs or<br />
GPGPUs. They came to <strong>the</strong> conclusion that <strong>the</strong>se technologies are useful when reducing<br />
<strong>the</strong> power consumption is a primary goal, but <strong>the</strong>se technologies also require a significant<br />
level <strong>of</strong> application parallelism to work efficient. Such solutions may help to reduce<br />
energy demands in some areas, but by <strong>the</strong> fact that limited parallelism is <strong>the</strong> most critical<br />
source <strong>of</strong> dark silicon (Section 3), it is doubtful that this technologies are suitable for <strong>the</strong><br />
majority <strong>of</strong> <strong>the</strong> applications.<br />
Hempstead, Wei, and Brooks presented a modeling framework for upcoming technology<br />
generations called Nagivo [9]. They also came to <strong>the</strong> conclusion that specialization<br />
to specific application may overcome energy problems. Fur<strong>the</strong>rmore <strong>the</strong>y made very optimistic<br />
assumptions regarding <strong>the</strong> possible parallelism, so it is also problematic to solve<br />
dark silicon problems with this approach.<br />
5 Conclusions<br />
Historically processor speedup was achieved by increasing <strong>the</strong> chip complexity and increasing<br />
<strong>the</strong> used frequency. This scaling failed in <strong>the</strong> last year, caused by an exorbitant<br />
growth <strong>of</strong> <strong>the</strong> energy consumption. The answer <strong>of</strong> <strong>the</strong> computer engineers were multicore<br />
processors, which results in many new problems. This work presented an analysis<br />
63
Christoph Kleineweber<br />
<strong>of</strong> <strong>the</strong> performance scaling <strong>of</strong> multicore CPUs and GPUs with a focus on <strong>the</strong> effect <strong>of</strong><br />
dark silicon. A device model, which predicts upcoming semiconductor technologies, a<br />
core model which predicts <strong>the</strong> upcoming single core performance and a multicore model,<br />
which enables us to make predictions on <strong>the</strong> speedup by using multicore processors were<br />
presented. We have seen that even with a optimistic technology scaling, proposed by <strong>the</strong><br />
ITRS roadmap, it is impossible to hold <strong>the</strong> historical performance growth.<br />
Finally we have to consider <strong>the</strong> question about <strong>the</strong> significance <strong>of</strong> this work. The<br />
relevant factor is here <strong>the</strong> plausibility <strong>of</strong> <strong>the</strong> made assumptions and used techniques.<br />
To simplify <strong>the</strong> analysis, in <strong>the</strong> proposed models, <strong>the</strong>re were no consideration <strong>of</strong> simultaneous<br />
multithreading (SMT). SMT may cause in a additional speedup, but also be a<br />
performance drawback.<br />
Ano<strong>the</strong>r problem is that only <strong>the</strong> on-chip components were considered in <strong>the</strong> power<br />
analysis. There is a consensus that <strong>the</strong> fraction <strong>of</strong> <strong>the</strong>se components will increase in future.<br />
O<strong>the</strong>r system components will demand a larger part <strong>of</strong> <strong>the</strong> total power consumption,<br />
which may reduce <strong>the</strong> speedup and increase <strong>the</strong> fraction <strong>of</strong> dark silicon.<br />
The presented empirical data was only containing Intel and AMD processors, particularly<br />
ARM or Tilera cores were not considered, caused by missing SPECmark results.<br />
However, <strong>the</strong> presented model seems to be feasible in general, even though some<br />
smaller assumption at different sections <strong>of</strong> <strong>the</strong> study were optimistic. Specially <strong>the</strong> mentioned<br />
sources <strong>of</strong> dark silicon might be realistic. The fact that limited application parallelism<br />
is <strong>the</strong> most important reason for dark silicon shows, that also programmers have a<br />
large amount <strong>of</strong> <strong>the</strong> upcoming challenge to speedup applications.<br />
References<br />
[1] Gene M. Amdahl. Validity <strong>of</strong> <strong>the</strong> single processor approach to achieving large scale<br />
computing capabilities. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> April 18-20, 1967, spring joint computer<br />
conference, AFIPS ’67 (Spring), pages 483–485, New York, NY, USA, 1967.<br />
ACM.<br />
[2] Major Bhadauria, Vincent M. Weaver, and Sally A. McKee. Understanding parsec<br />
performance on contemporary cmps. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2009 IEEE International<br />
Symposium on Workload Characterization (IISWC), IISWC ’09, pages 98–<br />
107, Washington, DC, USA, 2009. IEEE Computer Society.<br />
[3] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec<br />
benchmark suite: characterization and architectural implications. In <strong>Proceedings</strong><br />
<strong>of</strong> <strong>the</strong> 17th international conference on Parallel architectures and compilation techniques,<br />
PACT ’08, pages 72–81, New York, NY, USA, 2008. ACM.<br />
[4] Shekhar Borkar. Thousand Core ChipsA Technology Perspective. In 2007 44th<br />
ACM/IEEE Design Automation Conference, pages 746–749. IEEE, June 2007.<br />
64
Will Dark Silicon Limit Multicore Scaling?<br />
[5] Shekhar Borkar. The Exascale challenge. <strong>Proceedings</strong> <strong>of</strong> 2010 International Symposium<br />
on VLSI Design, Automation and Test, pages 2–3, April 2010.<br />
[6] Eric S. Chung, Peter a. Milder, James C. Hoe, and Ken Mai. Single-Chip Heterogeneous<br />
Computing: Does <strong>the</strong> Future Include Custom Logic, FPGAs, and GPGPUs?<br />
2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages<br />
225–236, December 2010.<br />
[7] Hadi Esmaeilzadeh, Emily Blem, Karthikeyan Sankaralingam, and Doug Burger.<br />
Dark Silicon and <strong>the</strong> End <strong>of</strong> Multicore Scaling.<br />
[8] Zvika Guz, Evgeny Bolotin, Idit Keidar, Avinoam Kolodny, Avi Mendelson, and<br />
Uri C. Weiser. Many-core vs. many-thread machines: Stay away from <strong>the</strong> valley.<br />
IEEE Comput. Archit. Lett., 8:25–28, January 2009.<br />
[9] Mark Hempstead, Gu-yeon Wei, and David Brooks. Navigo: An Early-Stage Model<br />
to Study Power-Constrained Architectures and Specialization. 2009.<br />
[10] Mark D. Hill and Michael R. Marty. Amdahl’s Law in <strong>the</strong> Multicore Era. Computer,<br />
41(7):33–38, July 2008.<br />
[11] Gordon E. Moore. Cramming more components onto integrated circuits, reprinted<br />
from electronics, volume 38, number 8, april 19, 1965, pp.114 ff. Solid-State Circuits<br />
Newsletter, IEEE, 20(3):33 –35, sept. 2006.<br />
65
Guiding Computation Accelerators to Performance<br />
Optimization Dynamically<br />
Sandeep Korrapati<br />
University <strong>of</strong> Paderborn<br />
sandeep@uni-paderborn.de<br />
January, 13 2012<br />
Abstract<br />
Constant demand for performance optimization and increase in efficiency <strong>of</strong><br />
computation has paved to many advancements in <strong>the</strong> design <strong>of</strong> <strong>the</strong> embedded processors.<br />
Usage <strong>of</strong> application specific instruction set processors(ASIPs) is one <strong>of</strong> <strong>the</strong><br />
most popular approaches. The hardware, computation accelerators used in ASIPS,<br />
are customized as per <strong>the</strong> extensions to instruction set(ISE). In order to capitalize <strong>the</strong><br />
performance gain provided by <strong>the</strong>se customized accelerators, <strong>the</strong> applications have<br />
to be compiled with <strong>the</strong>se ISEs. An approach to (1)dynamically utilize <strong>the</strong>se customized<br />
accelerators for <strong>the</strong> applications that are not compiled with <strong>the</strong> ISEs, (2)<strong>the</strong><br />
problems faced due to <strong>the</strong> dynamic approach and (3)<strong>the</strong> methods used to resolve<br />
<strong>the</strong>m, are explained in detail in this paper.<br />
1 Introduction<br />
1.1 Introduction to Terminology<br />
Compilation <strong>of</strong> an application involves decoding <strong>of</strong> <strong>the</strong> instructions and storing <strong>the</strong>m in a<br />
convenient way so that it can referenced later easily. The compiler views <strong>the</strong>se decoded<br />
instructions as a graph, referred to as dataflow graph(DFG). A portion <strong>of</strong> this dataflow<br />
graph is <strong>of</strong>ten extracted to fuse <strong>the</strong>m into macro-ops or to map <strong>the</strong>m onto a specialized<br />
hardware. These portions <strong>of</strong> <strong>the</strong> DFG are referred to as subgraphs. Compiler requires a<br />
mapping that describes <strong>the</strong> flow <strong>of</strong> <strong>the</strong> control within <strong>the</strong>se instructions. Hence, it extracts<br />
a graph depicting <strong>the</strong> flow <strong>of</strong> control from <strong>the</strong> DFG. This is referred to as controlflow<br />
graph(CFG).<br />
1.2 Origin<br />
Present day embedded systems are expected to perform complex computations like, processing<br />
images, signals, video streams, etc., efficiently. General purpose processors may<br />
67
Sandeep Korrapati<br />
fail to meet <strong>the</strong> demands <strong>of</strong> complex instructions, in terms <strong>of</strong> performance and power<br />
costs. Customizing hardware is a commonly opted method for meeting <strong>the</strong>se performance<br />
requirements within a limited power and cost constraints. Traditionally, application specific<br />
integrated circuits(ASICs) are used in embedded systems to perform computation<br />
intensive tasks. ASICs are non programmable hardware customizations, that aid in realizing<br />
efficient solutions. In ASICs, <strong>the</strong> critical functionality is mapped directly onto <strong>the</strong><br />
hardware implementations, reducing <strong>the</strong> burden on <strong>the</strong> processor and <strong>the</strong>re by resulting in<br />
better performance. Although ASICs yield in better performance compared to <strong>the</strong> o<strong>the</strong>r<br />
solutions, lack <strong>of</strong> programmability makes it a bad choice as only few applications can<br />
fully benefit from <strong>the</strong>m. Any changes in <strong>the</strong> application may deprive it <strong>of</strong> <strong>the</strong> advantages<br />
<strong>of</strong> <strong>the</strong> ASICs. Moreover, introduction <strong>of</strong> an ASICs requires rewriting <strong>of</strong> <strong>the</strong> application<br />
to be able to take advantage <strong>of</strong> <strong>the</strong> ASICs.<br />
An alternative approach is to employ smaller, but compilable hardware, referred to as<br />
computation accelerators. These accelerators are customized as per certain specific complex<br />
operations, and <strong>the</strong> instruction set should incorporate <strong>the</strong>se instructions. The application<br />
specific instruction processors(ASIPs) utilize computation accelerators, incorporated<br />
into its processor pipeline. Computation accelerators can provide several advantages,<br />
including reduced latency for subgraph execution, increased execution bandwidth, improved<br />
utilization <strong>of</strong> pipeline resources, and reduced burden on <strong>the</strong> register file for storing<br />
temporary values. ASIPs unlike ASICs are reprogrammable, have time to market advantage<br />
over ASICs and produce a better performance when compared to traditional general<br />
purpose processors.<br />
The multiply accumulate(MAC) unit is one <strong>of</strong> <strong>the</strong> most widely used accelerator in<br />
industry. Accelerators find a common use in DSPs, where common computations like dot<br />
product, sum <strong>of</strong> absolute differences, and compare select, in signal and image processing,<br />
are mapped onto <strong>the</strong>m. Accelerators are fur<strong>the</strong>r classified into two types, generalized accelerators<br />
and specialized accelerators. The design <strong>of</strong> generalized accelerators is mainly<br />
architecture dependent. Some <strong>of</strong> <strong>the</strong>m being, 3-1 ALUs, closed-loop ALUs, etc. Larger<br />
<strong>the</strong> accelerator, bigger <strong>the</strong> subgraph it can support and thus higher performance enhancement.<br />
But, increase in capacity <strong>of</strong> <strong>the</strong> accelerators reduces <strong>the</strong> options <strong>of</strong> its deployability,<br />
only a fewer applications can benefit from <strong>the</strong>m. FPGA-style accelerators, configurable<br />
compute accelerators, and programmable carry functions are some <strong>of</strong> <strong>the</strong> successful bigger<br />
accelerators. As <strong>the</strong> name suggests, specialized accelerators target a particular application.<br />
These syn<strong>the</strong>sized accelerators are mostly employed in commercial tool chains,<br />
e.g. Tensilica Xtensa, ARC Architect, and ARM OptimoDE.<br />
Complex algorithms have been developed over <strong>the</strong> period, to identify <strong>the</strong> subgraphs<br />
that can be executed on <strong>the</strong> accelerators. These algorithms require instruction set extensions(ISE)<br />
for <strong>the</strong> instructions supported by <strong>the</strong> accelerators, to select <strong>the</strong> subgraphs.<br />
Then "control flow graph" to isolate subset <strong>of</strong> <strong>the</strong> subgraphs that would improve overall<br />
performance. Usually <strong>the</strong>se algorithms are incorporated into <strong>the</strong> compilation process,<br />
making <strong>the</strong> approach static. And hence, <strong>the</strong> applications that are not compiled with <strong>the</strong>se<br />
ISEs face binary compatibility problem, and cannot benefit from <strong>the</strong>se accelerators. The<br />
authors have proposed using dynamic binary translation(DBT) approach to overcome bi-<br />
68
Guiding Computation Accelerators to Performance Optimization Dynamically<br />
nary compatibility. It enables <strong>the</strong> applications not compiled with <strong>the</strong>se ISEs also to benefit<br />
from <strong>the</strong> computation accelerators.<br />
Dynamic binary translation, in principle looks at a short sequence <strong>of</strong> code, typically in<br />
<strong>the</strong> order <strong>of</strong> a single basic block, <strong>the</strong>n translates it and caches resulting sequence. Code is<br />
only translated as it is discovered and when possible. The overhead during <strong>the</strong> translation<br />
time can be amortized if translated code sequences are executed multiple times.<br />
Dynamic binary translation has been proven effective in embedded systems like power<br />
management, security, s<strong>of</strong>tware caches, instruction set translation, memory management<br />
etc. The authors have used this technique to collapse critical computations subgraphs<br />
into ISEs during runtime and <strong>the</strong>reby mapping <strong>the</strong>m onto <strong>the</strong> accelerators without <strong>the</strong><br />
necessity to recompile. As this processing has to be done during runtime, it poses certain<br />
limitations. The authors describe <strong>the</strong>ir implementation using dynamic binary translator,<br />
<strong>the</strong> difficulties in achieving it and <strong>the</strong> methods used to overcome <strong>the</strong>m.<br />
In <strong>the</strong> current document <strong>the</strong> work done in [3] will be explained in detail. In <strong>the</strong> Section<br />
2, similar works done to improve <strong>the</strong> performance is explained. In Section 3, <strong>the</strong><br />
methodology <strong>of</strong> algorithms employed in <strong>the</strong> static approaches is described in detail. Fur<strong>the</strong>r<br />
in Section 4, a similar implementation is explained, to give a better understanding <strong>of</strong><br />
<strong>the</strong> work done in [3]. And in <strong>the</strong> Section 5 <strong>the</strong> implementations used in [3] is explained.<br />
Then finally <strong>the</strong> work is concluded in Section 6 with an overview.<br />
2 Related Work<br />
Attempts to improve <strong>the</strong> performance <strong>of</strong> <strong>the</strong> embedded <strong>the</strong> systems have taken place in<br />
many areas. Most <strong>of</strong> research has been in <strong>the</strong> field <strong>of</strong> automating <strong>the</strong> generation <strong>of</strong> ISEs.<br />
Whenever a new accelerator is developed, or an existing accelerator is modified, an ISE<br />
suitable to <strong>the</strong> hardware should also be developed. Development <strong>of</strong> this ISE has to be<br />
monitored and tested well enough to guarantee full benefits <strong>of</strong> <strong>the</strong> hardware. By automating<br />
<strong>the</strong> process <strong>of</strong> generation <strong>of</strong> <strong>the</strong> ISE, time required to invest in its design and testing<br />
can be avoided. There by providing for early release <strong>of</strong> <strong>the</strong> product into market.<br />
There have also been researches in <strong>the</strong> hardware structure <strong>of</strong> an accelerator. Some<br />
<strong>of</strong> <strong>the</strong>m include, an attempt to serialize <strong>the</strong> register files access to increase <strong>the</strong> number<br />
<strong>of</strong> register file ports. Flexible configurable compute accelerator, which can be integrated<br />
into a pre-designed processor core through a simple interface, was ano<strong>the</strong>r attempt.<br />
Next is in <strong>the</strong> usage <strong>of</strong> an accelerator, as described in Section 1, most <strong>of</strong> <strong>the</strong> o<strong>the</strong>r<br />
practices use static approach. The identification <strong>of</strong> <strong>the</strong> subgraphs and mapping <strong>the</strong>m onto<br />
<strong>the</strong> accelerator is done during compilation, along with generating ISEs. Some researches<br />
also include dynamic hardware approaches, designed for systems with trace.<br />
The most related research from <strong>the</strong> authors was explained in [1]. This involves fusing<br />
<strong>of</strong> dependent micro-ops into macro-ops to run on 3-1 ALUs, <strong>the</strong>reby increasing <strong>the</strong><br />
instruction level parallelism. One limitation <strong>of</strong> this approach is that, it only focuses on a<br />
specific architecture. This is a co-designed virtual machine approach, with an enhanced<br />
superscalar microarchitecture. It is explained in detail in <strong>the</strong> Section 4.<br />
69
Sandeep Korrapati<br />
3 Static Approach<br />
The standard implementations <strong>of</strong> ASIPs incorporate <strong>the</strong>ir implementation into compilation.<br />
Hence <strong>the</strong> performance <strong>of</strong> accelerators in <strong>the</strong>se ASIPs greatly depends on <strong>the</strong><br />
compiler support. The compiler has two major tasks when targeting a computation accelerator.<br />
Firstly, it must identify <strong>the</strong> candidate subgraphs in <strong>the</strong> target application that<br />
can executed on <strong>the</strong> accelerator. This task gets complicated when an accelerator supports<br />
multiple functionality, especially when some <strong>of</strong> <strong>the</strong>m are a superset <strong>of</strong> o<strong>the</strong>rs. This task is<br />
commonly known as subgraph isomorphism. The second task is to select those candidate<br />
subgraphs, that can be executed on <strong>the</strong> accelerator. Candidates <strong>of</strong>ten overlap, hence <strong>the</strong><br />
compiler must select a subset <strong>of</strong> <strong>the</strong>se candidates in order to maximize performance gain.<br />
For <strong>the</strong> compiler to be able to identify <strong>the</strong>se subgraphs, <strong>the</strong> instructions supported<br />
by <strong>the</strong> accelerator have to be incorporated into <strong>the</strong> instruction set, i.e. an extension to<br />
instruction set(ISE) has to be designed as per <strong>the</strong> accelerator. When an application is<br />
compiled with <strong>the</strong>se ISEs, <strong>the</strong> subgraphs that can be executed by <strong>the</strong> accelerators are<br />
identified and replaced with suitable instructions to invoke an accelerator.<br />
Greedy compiler approach has been a common approach in <strong>the</strong> beginning. In this approach<br />
an operation(referred as seed) is selected and expanded till its compatible with <strong>the</strong><br />
accelerator. But, this approach produces only a sub-optimal solution and mostly breaks<br />
down for larger accelerators. There has been a lot <strong>of</strong> research in this area, and better and<br />
complex algorithms were developed.<br />
As it is during <strong>the</strong> compilation, identification <strong>of</strong> <strong>the</strong> sub-graphs and selection <strong>of</strong> <strong>the</strong><br />
candidate subgraphs for execution on <strong>the</strong> computation accelerator are done, <strong>the</strong> complexity<br />
<strong>of</strong> <strong>the</strong> algorithms and algorithm execution time are not highly restricted. Moreover<br />
<strong>the</strong> data flow and control flow information <strong>of</strong> <strong>the</strong>se subgraphs can be obtained from <strong>the</strong><br />
compilation as <strong>the</strong> subgraphs are already identified, <strong>the</strong>reby reducing any burden on <strong>the</strong><br />
execution. The availability <strong>of</strong> <strong>the</strong> control flow information eases up <strong>the</strong> scheduling <strong>of</strong> <strong>the</strong><br />
instructions, avoiding <strong>the</strong> conflicts.<br />
4 Dynamic approach for CISC processors<br />
The authors <strong>of</strong> [1], describe a dynamic approach to improve <strong>the</strong> performance <strong>of</strong> a traditional<br />
x86 processor with an enhanced superscalar microarchitecture, and a layer <strong>of</strong> concealed<br />
dynamic binary translation s<strong>of</strong>tware that is co-designed with <strong>the</strong> hardware. The<br />
main concept behind <strong>the</strong> optimization proposed here, is to combine dependent micro-op<br />
pairs into fused "macro-ops" and are managed throughout <strong>the</strong> pipeline as single entities.<br />
Authors state that, although a CISC instruction set architecture(ISA) already has<br />
instructions that are essentially fused micro-ops, higher efficiency and performance can<br />
be achieved by first cracking <strong>the</strong> CISC instructions and rearranging and fusing <strong>the</strong>m into<br />
different combinations than in <strong>the</strong> original code.<br />
The proposed implementation contains two major components, s<strong>of</strong>tware binary translator<br />
and <strong>the</strong> supporting hardware architecture. The interface between <strong>the</strong> two is <strong>the</strong> x86-<br />
70
Guiding Computation Accelerators to Performance Optimization Dynamically<br />
Figure 1: Overview <strong>of</strong> proposed x86 desing in [1]<br />
specific implementation instruction set. A two-level decoder has been introduced, as part<br />
<strong>of</strong> <strong>the</strong> proposed architecture. The decoder first translates <strong>the</strong> x86 instructions into microops.<br />
The second level decoder generates <strong>the</strong> decoded control signals used by <strong>the</strong> pipeline.<br />
The pipeline is designed to have two modes, one to process <strong>the</strong> x86 instructions(x86mode)<br />
and <strong>the</strong> o<strong>the</strong>r for fused macro-ops(macro-op mode). A pr<strong>of</strong>iling hardware is used<br />
to identify <strong>the</strong> frequently used code regions(hotspots). As hotspots are discovered, <strong>the</strong>y<br />
are organized into special blocks called, superblocks, translated and optimized as fused<br />
macro-ops. These fused macro-ops are placed into a concealed code cache. To reduce<br />
pipeline complexity, fusing is performed only for dependent micro-op pairs that have a<br />
combined total <strong>of</strong> two or fewer unique input register operands. When executing <strong>the</strong>se<br />
macro-ops, <strong>the</strong> first level <strong>of</strong> decode, as shown in Figure 1, is bypassed. It only passes<br />
through <strong>the</strong> second decode level.<br />
The dynamic binary translation s<strong>of</strong>tware optimizes <strong>the</strong>se hotspots by finding critical<br />
macro-op pairs for fusing, by analyzing overall micro-ops, reordering <strong>the</strong>m and fusing<br />
pairs <strong>of</strong> operations taken from different x86 instructions. For <strong>the</strong> optimized macro-op<br />
code, paired dependent micro-ops are placed in adjacent memory locations and are identified<br />
via a special fuse bit. Two main strategies are used for fusing. First, single-cycle<br />
micro-ops are given higher priority as <strong>the</strong> head <strong>of</strong> <strong>the</strong> pair. Second, higher priority is given<br />
to pairing micro-ops that are close toge<strong>the</strong>r in <strong>the</strong> original x86 code sequence. The reason<br />
being that, <strong>the</strong>se pairs are more likely to be in <strong>the</strong> program’s critical path and should be<br />
scheduled for fused execution in oder to reduce <strong>the</strong> critical path latency. Ano<strong>the</strong>r constraint<br />
considered is that, <strong>the</strong> oder <strong>of</strong> memory operations has to be maintained.<br />
Algorithm Functionality<br />
A forward two-pass scan algorithm is utilized to create fused macro-ops quickly and effectively.<br />
Once a data dependence graph is created, <strong>the</strong> first pass considers single-cycle<br />
micro-ops one-by-one as tail candidates. For each tail candidate, <strong>the</strong> algorithm looks<br />
backward in <strong>the</strong> micro-op stream, to find a head for it. The algorithms proceeds by looking<br />
from <strong>the</strong> second micro-op in <strong>the</strong> backward order, till <strong>the</strong> last(i.e. first <strong>of</strong> <strong>the</strong> actual<br />
stream) in <strong>the</strong> block containing <strong>the</strong> translated code(superblock). Its constraints are to find<br />
a nearest preceding micro-op as head, <strong>the</strong> micro-op should be <strong>of</strong> single-cycle and mainly,<br />
it should produce one <strong>of</strong> <strong>the</strong> tail candidate’s input operands. This emphasizes that <strong>the</strong><br />
fusing rules favor dependent pairs with condition code dependence. The pairs that have<br />
71
Sandeep Korrapati<br />
Figure 2: Two pass algorithm used in [1]<br />
Figure 3: Example <strong>of</strong> a Two pass algorithm from [1]<br />
satisfied <strong>the</strong> above conditions, will <strong>the</strong>n go through some tests(fusing tests). These tests<br />
make sure that no fused macro-ops can have more than two distinct source operands, break<br />
any dependence in <strong>the</strong> original code, or break memory ordering. Macro-ops having more<br />
than two source operands become an overhead on <strong>the</strong> pipeline, induce more latency than<br />
<strong>the</strong> actual performance gain obtained. As it is understood, breaking any dependence in<br />
<strong>the</strong> original code will result in undesired results. Fur<strong>the</strong>rmore, <strong>the</strong> memory ordering hardware<br />
can be left simple if <strong>the</strong> memory ordering is not broken while fusing <strong>the</strong> operations.<br />
This leads to <strong>the</strong> end <strong>of</strong> <strong>the</strong> first scan. In <strong>the</strong> second scan <strong>the</strong> multi-cycle micro-ops<br />
are considered as candidate tails. The same steps are run again to detect if a suitable head<br />
can be located in <strong>the</strong> superblock.<br />
The Figure 3 illustrates a good example, showing how an x86 code is decoded to<br />
micro-ops and <strong>the</strong>n how dependent pairs are fused into macro-ops. The translator first<br />
cracks <strong>the</strong> default operations <strong>of</strong> x86 into micro-ops, as depicted in Figure 3b. Reax<br />
denotes <strong>the</strong> native register to which <strong>the</strong> x86 eax register is mapped. The long immediate<br />
080b8658 is allocated to register R18 as it is used <strong>of</strong>ten. First a dependence graph is<br />
72
Guiding Computation Accelerators to Performance Optimization Dynamically<br />
built for <strong>the</strong> translated instructions. Then <strong>the</strong> two-pass fusing algorithm looks for pairs<br />
<strong>of</strong> dependent single-cycle ALU micro-ops during <strong>the</strong> first scan. It can be seen that in <strong>the</strong><br />
current example, <strong>the</strong> AND and <strong>the</strong> first ADD are fused(marked by :: in Figure 3c). There<br />
is a reordering in <strong>the</strong> instructions due to <strong>the</strong> fused pair. This would result in overwriting<br />
<strong>the</strong> value <strong>of</strong> Reax to be used in store operation, by AND operation moved up. Register<br />
assignments is used to resolve such issues, in this case R20 is assigned to hold <strong>the</strong> value<br />
from <strong>the</strong> ADD operation, such that it can be used in both AND and ST operation. As<br />
<strong>the</strong> fusing algorithm also considers multi-cycle micro-ops as candidate tails, during <strong>the</strong><br />
second pass, <strong>the</strong> last two dependent micro-ops are fused toge<strong>the</strong>r. Even though <strong>the</strong> tail is<br />
a multi-cycle micro-op, <strong>the</strong> head still remains to be a single-cycle micro-op, which is a<br />
constraint followed by this algorithm.<br />
The two-pass algorithm described here is proven to be more advantageous than <strong>the</strong><br />
single pass algorithm used in [2]. The single pass algorithm described <strong>the</strong>re, would fuse<br />
<strong>the</strong> first ADD with <strong>the</strong> following ST operation aggressively, which would not be on critical<br />
path. Using memory instructions as tails may also slow down <strong>the</strong> wakeup <strong>of</strong> <strong>the</strong> entire<br />
pair, thus loosing cycles when <strong>the</strong> head micro-op is critical for ano<strong>the</strong>r dependent microop.<br />
Although <strong>the</strong> two-pass algorithm comes with slightly higher translation overhead and<br />
fewer fused micro-ops overall, <strong>the</strong> generated code runs significantly faster in pipelined<br />
issue logic.<br />
Observation<br />
A co-designed virtual machine paradigm is applied to improve efficiency and performance<br />
<strong>of</strong> an x86 processor. With a cost-effective hardware support and co-designed runtime s<strong>of</strong>tware<br />
optimizers, <strong>the</strong> VM approach achieves higher performance for macro-op mode with<br />
minimal performance loss in x86 mode, during <strong>the</strong> startup. This optimizes <strong>the</strong> vast microops<br />
generated by <strong>the</strong> translator from <strong>the</strong> x86 code, and is applicable to CISC processors<br />
in general. The proposed implementation, improves <strong>the</strong> x86 IPC performance by 20% on<br />
average over a comparable conventional superscalar design. The large performance gain<br />
comes from macro-op fusing, which treats fused micro-ops as single entities throughout<br />
<strong>the</strong> pipeline to improve instruction level parallelism(ILP), which reduces <strong>the</strong> communication<br />
and management overhead. O<strong>the</strong>r features such as superblock code re-layout, a<br />
shorter decode pipeline for optimized hotspot code(as <strong>the</strong> first level decoder is skipped)<br />
and <strong>the</strong> use <strong>of</strong> 3-1 ALU(which results in reduced latency for some branches and loads),<br />
also contribute to <strong>the</strong> performance improvement. This implementation proved to be a<br />
promising approach, that addresses <strong>the</strong> thorny and challenging issues present in CISC<br />
ISA such as <strong>the</strong> x86.<br />
5 Dynamic Optimization for Computation Accelerators<br />
The authors <strong>of</strong> [3] have proposed an approach to dynamically optimize <strong>the</strong> utilization <strong>of</strong><br />
computation accelerators. It is a more generic approach when compared to <strong>the</strong> approach<br />
73
Sandeep Korrapati<br />
discussed in [1], which focuses mainly on CISC processors. One more significant feature<br />
<strong>of</strong> this approach is that, it is purely a s<strong>of</strong>tware oriented optimization. The authors<br />
have described in this paper, <strong>the</strong> techniques used to incorporate <strong>the</strong> accelerator utilization<br />
into dynamic binary translation technique, to overcome <strong>the</strong> binary compatibility problems<br />
posed by not compiling <strong>the</strong> applications with <strong>the</strong> ISEs. Due to its incorporation during<br />
runtime, <strong>the</strong>re are certain limitations on <strong>the</strong> implementation. Methods used to overcome<br />
<strong>the</strong>se limitations are also explained here.<br />
5.1 Integration<br />
The typical accelerator utilization process is integrated into a dynamic binary translation<br />
system by introducing <strong>the</strong> author’s optimization technique between <strong>the</strong> trace-formation<br />
and Superblock-cache modules.<br />
The basic flow <strong>of</strong> a dynamic binary translation technique consists <strong>of</strong> three stage and<br />
a manager module, responsible for high level control. In <strong>the</strong> first stage <strong>the</strong> instruction is<br />
interpreted and emulated. During <strong>the</strong> emulation, <strong>the</strong> hotspot regions are searched. If an<br />
hotspot region is identified, <strong>the</strong>n it is forwarded to <strong>the</strong> Trace Formation stage. Meanwhile<br />
<strong>the</strong> translator continuously translates <strong>the</strong> instructions until <strong>the</strong> stopping conditions are met.<br />
The translated instruction are formed into large block called superblock. The so formed<br />
superblocks undergo some optimization techniques, and <strong>the</strong> optimized code is placed into<br />
a cache called Superblock Cache. The blocks <strong>of</strong> code placed into <strong>the</strong> superblock cache are<br />
indexed using an address map table. After initial warmup, and some optimized blocks are<br />
put into superblock cache, <strong>the</strong> instructions being interpreted are compared with <strong>the</strong> ones<br />
present in <strong>the</strong> superblock cache, to check if suitable mapping is already present. If <strong>the</strong>re is<br />
a corresponding hit in <strong>the</strong> superblock cache, <strong>the</strong>n <strong>the</strong> instruction is fetched from <strong>the</strong> cache<br />
and executed. If <strong>the</strong>re is no hit, <strong>the</strong>n <strong>the</strong> instruction is passed to <strong>the</strong> interpretation stage<br />
and <strong>the</strong> process flow is continued.<br />
The binary accelerator utilization process, proposed by <strong>the</strong> authors in [1], is incorporated<br />
as one <strong>of</strong> <strong>the</strong> optimization technique in <strong>the</strong> Optmization Stage(indicated as gray part<br />
in <strong>the</strong> Figure 4). This is regarded as a special kind <strong>of</strong> instruction-set-specific optimization.<br />
Apart from this, only few o<strong>the</strong>r required optimization techniques have used in <strong>the</strong>ir<br />
implementation, to fully measure <strong>the</strong> performance <strong>of</strong> <strong>the</strong>ir technique. The o<strong>the</strong>r optimization<br />
techniques that were added include indirect branch(ex: jump) removal, superblock<br />
chaining(identifying <strong>the</strong> dependencies among he superblocks and scheduling <strong>the</strong>m in a<br />
proper way).<br />
Unlike static approach, which is done on a compiled code, where <strong>the</strong> data flow and<br />
control flow graph are already constructed, <strong>the</strong> dynamic approach that lacks constructed<br />
data or control flow faces many problems. Constructing <strong>the</strong> exact control flow graph could<br />
be time consuming and even impossible. Without a proper control flow, <strong>the</strong> dependencies<br />
among <strong>the</strong> data blocks cannot be identified.<br />
The authors mainly concentrate on generating Dataflow Analysis and Subgraph Mapping<br />
to map critical dataflow subgraphs into ISEs during runtime without any control flow<br />
information from compilation, using <strong>the</strong> dynamic binary translation.<br />
74
Guiding Computation Accelerators to Performance Optimization Dynamically<br />
5.2 Functional Description<br />
Figure 4: A typical DBT workflow from [3]<br />
The main factor to be considered is <strong>the</strong> execution time. In a static approach, <strong>the</strong> time<br />
taken for dataflow analysis and subgraph mapping is done on <strong>the</strong> intermediate code with<br />
<strong>the</strong> help <strong>of</strong> <strong>the</strong> control flow information from <strong>the</strong> compilation framework and hence not<br />
considered into <strong>the</strong> actual execution time. Where as in a dynamic approach <strong>the</strong> dataflow<br />
analysis and subgraph mapping is performed on <strong>the</strong> final binary code, fur<strong>the</strong>rmore without<br />
any control flow information. As this is performed during <strong>the</strong> runtime, <strong>the</strong> complexity <strong>of</strong><br />
<strong>the</strong> algorithms used have to be checked, <strong>the</strong> execution time <strong>of</strong> <strong>the</strong>se algorithms is also<br />
counted into <strong>the</strong> actual execution time <strong>of</strong> <strong>the</strong> application. One more constraint <strong>of</strong> working<br />
on final binary is that <strong>the</strong> number <strong>of</strong> intermediate variables are limited to <strong>the</strong> architecture<br />
registers. Although <strong>the</strong> number <strong>of</strong> intermediate variables are limited, <strong>the</strong> usage <strong>of</strong> system<br />
registers <strong>of</strong>fers extra benefits. The sections 5.3 and 5.4 explain major functionalities in<br />
detail.<br />
5.3 Dataflow Analysis<br />
Dataflow analysis is an important part to obtain compilation optimizations. The identified<br />
dataflow graphs are mapped onto <strong>the</strong> accelerators to increase <strong>the</strong> performance as <strong>the</strong>y<br />
can be executed easily <strong>the</strong>re. The dataflow analysis is split into two parts (1) Intra-block<br />
dataflow analysis, to identify <strong>the</strong> dependent instructions within a superblock and (2) Interblock<br />
dataflow analysis, to avoid unsafe code transformation which might be caused due<br />
to live-out registers <strong>of</strong> one block used in ano<strong>the</strong>r.<br />
Obtaining <strong>the</strong> overall dataflow information is not a better option during runtime as<br />
it could take longer time duration. And inturn it would effect <strong>the</strong> overall performance.<br />
Hence, in <strong>the</strong> current implementation <strong>the</strong> dataflow is analyzed block by block.<br />
75
Sandeep Korrapati<br />
5.3.1 Intra-block Dataflow Analysis:<br />
The usual algorithm used to build a dataflow graph is <strong>the</strong> simple brute-force algorithm run<br />
twice through <strong>the</strong> list <strong>of</strong> instructions. The first to identify an instruction and <strong>the</strong> second<br />
to check if each <strong>of</strong> <strong>the</strong>se instructions uses <strong>the</strong> result <strong>of</strong> any previous instruction. If such<br />
an instruction is found <strong>the</strong>n a dataflow edge is set from <strong>the</strong> previous instruction to <strong>the</strong><br />
current instruction. This resulting in an algorithm <strong>of</strong> O(n 2 ). Moreover <strong>the</strong>se algorithms<br />
run on intermediate code before register allocation, hence <strong>the</strong> <strong>the</strong>y can use any number <strong>of</strong><br />
variables to store temporary values.<br />
As <strong>the</strong> dynamic binary translation systems perform this check on <strong>the</strong> final binary form<br />
<strong>of</strong> an application, <strong>the</strong> number <strong>of</strong> variables are restricted to <strong>the</strong> number <strong>of</strong> <strong>the</strong> architecture<br />
registers. But, <strong>the</strong> usage <strong>of</strong> architecture registers provides an extra benefit, which is exploited<br />
by <strong>the</strong> authors. The algorithm maintains an array <strong>of</strong> size <strong>of</strong> number <strong>of</strong> registers,<br />
which is used to store <strong>the</strong> instruction number that has modified a register last. In any<br />
instruction <strong>the</strong>re is one target register, where <strong>the</strong> result is stored and at most two more<br />
registers, source registers, which contain <strong>the</strong> data required for performing <strong>the</strong> operation.<br />
For each instruction <strong>the</strong> source registers are checked for in <strong>the</strong> maintained array to see if it<br />
was modified by any previous instruction in <strong>the</strong> current block. If <strong>the</strong> corresponding entry<br />
is not zero, <strong>the</strong>n a dataflow edge is set from that instruction to <strong>the</strong> current instruction.<br />
Thereby <strong>the</strong> order <strong>of</strong> magnitude <strong>of</strong> <strong>the</strong> algorithm is reduced to O(n). This has also proven<br />
to be efficient upto 68% to 96.82% in <strong>the</strong> benchmarks run by <strong>the</strong> authors.<br />
5.3.2 Inter-block Dataflow Analysis<br />
Although <strong>the</strong> dataflow <strong>of</strong> <strong>the</strong> subgraphs is contained within <strong>the</strong> superblock in most cases,<br />
<strong>the</strong> subgraphs near <strong>the</strong> block borders have to handled with care. If <strong>the</strong>re are any liveout<br />
nodes(registers utilized with <strong>the</strong> block) from a block, <strong>the</strong>y have to be killed in <strong>the</strong><br />
successor block. O<strong>the</strong>rwise it might lead in an unsafe code transformation.<br />
For example if a target register used within <strong>the</strong> current subgraph is not used by its<br />
end, it is considered to be a live-out node. The successor blocks using <strong>the</strong>se registers<br />
have to be informed <strong>of</strong> <strong>the</strong>m so that <strong>the</strong>y can redefine <strong>the</strong>se registers before using <strong>the</strong>m.<br />
Consider <strong>the</strong> subgraph surrounded by dash lines in <strong>the</strong> Figure 5, which corresponds <strong>the</strong><br />
<strong>the</strong> instructions 1, 3 and 5 <strong>of</strong> <strong>the</strong> machine code. Form this subgraph, it can be seen that<br />
<strong>the</strong> register $2 is a live-out register. If <strong>the</strong> successor subgraphs, outside <strong>the</strong> superblock,<br />
redefine register $2 before using it, <strong>the</strong>n authors suggest that it can be ported to a 1-output<br />
accelerator, o<strong>the</strong>rwise <strong>the</strong> accelerator should be at least a 2-output one.<br />
The algorithm proposed by <strong>the</strong> authors uses register masks to identify <strong>the</strong>se live-out<br />
nodes and kills <strong>the</strong>m. The registers used as part <strong>of</strong> this block are given as input <strong>of</strong> this<br />
algorithm and <strong>the</strong> current masks <strong>of</strong> <strong>the</strong> source registers are set to zero. Implying that <strong>the</strong>se<br />
instructions are used by <strong>the</strong> current instruction. If <strong>the</strong> bit mask <strong>of</strong> a modified register is still<br />
set to one, it implies that it is a live-out register. This bit mask is passed on to <strong>the</strong> successor<br />
block to notify it <strong>of</strong> <strong>the</strong> live-out nodes. If <strong>the</strong>se live out node are killed by end <strong>of</strong> <strong>the</strong><br />
successor block, it implies that <strong>the</strong>re is a dependency between <strong>the</strong>se two subgraphs. This<br />
dependency information is used while scheduling to avoid unsafe code transformations.<br />
76
Guiding Computation Accelerators to Performance Optimization Dynamically<br />
Figure 5: An example <strong>of</strong> inter-block dataflow [3]<br />
Figure 6: An example unsafe subgraphs [3]<br />
This algorithm has also proven to be 19.9% to 54.51% effective for different applications.<br />
The downside is that, <strong>the</strong> algorithm being depth-first searching algorithm it takes longer<br />
time for certain applications as it is not restricted. It can be resolved by putting a check to<br />
<strong>the</strong> max-depth.<br />
5.4 Subgraph Mapping<br />
5.4.1 Safety Checking<br />
Now that <strong>the</strong> dataflow information with for <strong>the</strong> superblock is available, <strong>the</strong> subgraphs<br />
have to be identified to form <strong>the</strong>m into an ISE. Subgraph mapping involves 1) collapsing<br />
several instruction into an ISE and 2) reordering code to group <strong>the</strong> dependent instructions.<br />
The subgraphs have to be chosen in such a way that <strong>the</strong> safety <strong>of</strong> <strong>the</strong> code is intact.<br />
77
Sandeep Korrapati<br />
Figure 7: An example <strong>of</strong> subgraphs among blocks [3]<br />
Some <strong>of</strong> <strong>the</strong> unsafe subgraph mappings can be seen in <strong>the</strong> Figure 6. The graphs Figure<br />
6(a) and Figure 6(b) show subgraphs with cyclic dependency. The problem shown<br />
in Figure 6(a) is referred to as a non-convex subgraph. This graphs shows that a cyclicdependence<br />
is formed between <strong>the</strong> operations in and out <strong>of</strong> <strong>the</strong> sub-graph. Hence <strong>the</strong><br />
implementation <strong>of</strong> <strong>the</strong> authors makes sure <strong>the</strong> instructions <strong>of</strong> <strong>the</strong> subgraph does not have<br />
a side path. The Figure 6(b) shows two subgraphs possible to be ISEs, but have interdependency.<br />
Such situations are avoided by choosing only one subgraph for ISE at a<br />
time.<br />
The Figure 6(c) shows ano<strong>the</strong>r for <strong>of</strong> unsafe code transformation. It would be unsafe<br />
if <strong>the</strong> subgraph is placed at <strong>the</strong> third instruction as register $10 is overwriten by <strong>the</strong> 2nd<br />
operation. Hence <strong>the</strong> placement <strong>of</strong> <strong>the</strong> subgraphs have to be carefully chosen.<br />
5.4.2 Subgraph Mapping among Blocks<br />
One more advantage <strong>of</strong> <strong>the</strong> runtime optimization is that <strong>the</strong> boundaries <strong>of</strong> <strong>the</strong> block are<br />
known. Additionally a pr<strong>of</strong>iler can be used to identify <strong>the</strong> critical paths, which is not<br />
possible in static approaches. Using <strong>the</strong>se informations <strong>the</strong> instructions can be moved<br />
among <strong>the</strong> blocks to form a better subgraph. An example instance <strong>of</strong> this can be seen in<br />
<strong>the</strong> Figure 7.<br />
5.4.3 Subgraph Mapping Strategy<br />
After <strong>the</strong> initial check are done on <strong>the</strong> subgraphs obtained from <strong>the</strong> basic blocks, mapping<br />
strategy falls down to two basic steps. First, <strong>the</strong> subgraphs have to be enumerated to<br />
obtain <strong>the</strong>ir critical sections that can be executed on accelerator. Second, select a subset<br />
<strong>of</strong> <strong>the</strong>se subgraphs which would result in optimal performance. As <strong>the</strong> mapping has to be<br />
done during runtime, authors have come up a variant <strong>of</strong> greedy approach, which marks <strong>the</strong><br />
78
Guiding Computation Accelerators to Performance Optimization Dynamically<br />
nodes that have been considered once. Here an operation is selected as seed, is expanded<br />
till a jump in control is observed. While selecting <strong>the</strong> new seed, only <strong>the</strong> unmarked seeds<br />
are considered. The so obtained subgraphs are <strong>the</strong>n mapped onto <strong>the</strong> accelerators.<br />
6 Conclusion<br />
Most <strong>of</strong> <strong>the</strong> areas <strong>of</strong> research in improving <strong>the</strong> performance <strong>of</strong> <strong>the</strong> accelerators have been<br />
about <strong>the</strong> hardware(static or dynamic) and automating <strong>the</strong> generation <strong>of</strong> ISEs. But, <strong>the</strong><br />
authors <strong>of</strong> [3] have proposed a dynamic approach to utilize <strong>the</strong> accelerators. Ano<strong>the</strong>r approach,<br />
a co-designed virtual machine paradigm presented in [1] is also explained here to<br />
provide a better understanding <strong>of</strong> <strong>the</strong> work flow <strong>of</strong> <strong>the</strong> accelerators. The algorithms proposed<br />
by <strong>the</strong> authors for dataflow analysis and subraph mapping during <strong>the</strong> runtime using<br />
dynamic binary translations have proven to be relatively effective for <strong>the</strong> applications that<br />
are not compiled with <strong>the</strong> ISEs. Although <strong>the</strong>re are lot <strong>of</strong> safety checks to done, <strong>the</strong> usage<br />
<strong>of</strong> registers in <strong>the</strong> algorithms during runtime has paved for better results.<br />
References<br />
[1] S. Hu, I. Kim, M. H. Lipasti, and J. E. Smith. An approach for implementing efficient<br />
superscalar cisc processors. http://ieeexplore.ieee.org/xpls/<br />
abs_all.jsp?arnumber=1598111&tag=1, February 2006.<br />
[2] S. Hu and James E. Smith. Using dynamic binary translation to fuse dependent<br />
instructions. http://dl.acm.org/citation.cfm?id=977395.<br />
977670&coll=DL&dl=ACM&CFID=61907142&CFTOKEN=18787638,<br />
March 2004.<br />
[3] Ya-shuai, Lü Li Shen, Zhi ying Wang, and Nong Xiao. Dynamically utilizing<br />
computation accelerators for extensible processors in a s<strong>of</strong>tware approach. http:<br />
//dl.acm.org/citation.cfm?doid=1629435.1629443, October 2009.<br />
79
A Case for Lifetime-Aware Task Mapping in<br />
Embedded Chip Multiprocessors<br />
Andre Koza<br />
University <strong>of</strong> Paderborn<br />
koza@mail.uni-paderborn.de<br />
January, 13 2012<br />
Abstract<br />
System lifetime <strong>of</strong> embedded systems is an important factor for reliability. Unpredicted<br />
failures <strong>of</strong> essential components can become a bottleneck for overall system<br />
lifetime. There are different approaches to increase lifetime. One way is to add<br />
additional resources to <strong>the</strong> system to cover for component failure. Ano<strong>the</strong>r way is to<br />
change <strong>the</strong> way in which resources are used. In this seminar paper three approaches,<br />
which enhance system lifetime, are presented. One focuses on lifetime-cost Paretooptimal<br />
slack allocation. Slack denotes resources that are initially not required but<br />
tasks and memory <strong>of</strong> failed components can be remapped to <strong>the</strong>m. The o<strong>the</strong>r two<br />
approaches focus on lifetime-aware task mappings, i.e. task mappings with <strong>the</strong> goal<br />
to improve lifetime. As a result all three approaches increase system lifetime. While<br />
slack allocation needs additional investment in hardware, task mappings only need a<br />
change in s<strong>of</strong>tware.<br />
1 Introduction<br />
Lifetime reliability <strong>of</strong> embedded chip multiprocessors has become important as unforeseen<br />
system failures can have dramatic results, e.g. <strong>the</strong> failure <strong>of</strong> a security system in an<br />
automobile. System lifetime has to be addressed in <strong>the</strong> design <strong>of</strong> <strong>the</strong> system [6]. In recent<br />
strategies on <strong>the</strong> one hand a system-level approach is used, in which <strong>the</strong> hardware or <strong>the</strong><br />
communication architecture is changed [9]. On <strong>the</strong> o<strong>the</strong>r hand lifetime can be improved<br />
by changing <strong>the</strong> way resources are used, e.g. by task mapping [6] [7].<br />
In this seminar paper three recent approaches to improve system lifetime in embedded<br />
systems are discussed. At first we look at a method for cost-effective slack allocation<br />
[9], which focuses on how to allocate additional resources to overcome whole system<br />
failure, when single parts <strong>of</strong> <strong>the</strong> system fail. In <strong>the</strong> paper <strong>the</strong> authors use slack to increase<br />
system lifetime <strong>of</strong> NoC-based (Network on Chip) MPSoCs (MultiProcessor System-on-<br />
Chip). Slack means additional execution and storage resources that are not required in a<br />
81
Andre Koza<br />
standard running state but when components fail, tasks and data <strong>of</strong> failed components can<br />
be scheduled and mapped to <strong>the</strong>se resources. In <strong>the</strong>ir Critical Quantity Slack Allocation<br />
(CQSA) technique <strong>the</strong> authors try to find an optimal trade<strong>of</strong>f between cost and lifetime<br />
improvement. The challenge in slack allocation is that <strong>the</strong> design space can be large and<br />
complex, i.e. that <strong>the</strong>re are many different positions where and how much slack should<br />
be allocated. With CQSA it is possible to find designs within 1.4% <strong>of</strong> <strong>the</strong> lifetime-cost<br />
Pareto-optimal while only exploring 1.4% <strong>of</strong> <strong>the</strong> design space.<br />
After this system-level approach, which changes <strong>the</strong> hardware, <strong>the</strong> next two approaches<br />
are based on nature inspired technologies: simulated annealing (SA) [7] and ant colony<br />
optimization (ACO) [6]. They target <strong>the</strong> task allocation and scheduling <strong>of</strong> processes<br />
to avoid overusing some resources while o<strong>the</strong>rs are idling or at least less used. These<br />
overused resources age faster than o<strong>the</strong>rs, and due to wearout <strong>the</strong>y will eventually fail<br />
earlier. Therefore <strong>the</strong>y become a reliability bottleneck resulting in a reduced system lifetime.<br />
The authors <strong>of</strong> [7] propose a lifetime reliability-aware task allocation for MPSoCs.<br />
They use simulated annealing for <strong>the</strong> task allocation. Their motivation is that wearout related<br />
failures <strong>of</strong> components have to be considered during <strong>the</strong> task allocation and scheduling<br />
process. The failure <strong>of</strong> important components reduces <strong>the</strong> reliability and <strong>the</strong> system<br />
lifetime. To compensate this a task allocation is developed that takes several wearout related<br />
factors such as temperature, circuit structure or voltage into account. The algorithm<br />
used for that task allocation is based on simulated annealing.<br />
The third approach that is presented in this seminar paper also analyzes <strong>the</strong> task allocation<br />
to gain lifetime improvements. The authors <strong>of</strong> [6] propose a lifetime-aware task<br />
mapping technique based on <strong>the</strong> nature inspired ant colony optimization. They tried to<br />
find a method for improving system lifetime without having to invest in additional hardware<br />
like in slack allocation. Their starting point was temperature aware task mapping<br />
but <strong>the</strong>y came to <strong>the</strong> conclusion that when only regarding temperature, one receives high<br />
fluctuation in system lifetime. Therefore <strong>the</strong>y considered o<strong>the</strong>r factors like electromigration<br />
or time-dependent dielectric breakdown. In <strong>the</strong>ir ACO-based method artificial ants<br />
explore a graph representation <strong>of</strong> a task mapping. The ants share information about a<br />
good path in <strong>the</strong> task graph and according to that information <strong>the</strong> following ants select<br />
paths which previously has been proven to be good ones. The authors showed in a wide<br />
spectrum <strong>of</strong> benchmarks that <strong>the</strong>ir approach reaches system mean time to failure within<br />
17.9% <strong>of</strong> <strong>the</strong> observed optimum.<br />
This seminar paper is organized as follows: In Section 2 related work to <strong>the</strong> presented<br />
approaches is shortly introduced. Then, in Section 3 <strong>the</strong> different methods for improving<br />
system lifetime are described in detail while <strong>the</strong> focus lies on ACO-based task mapping.<br />
After that, in Section 4 <strong>the</strong> methods are compared with each o<strong>the</strong>r with respect to effectiveness<br />
and cost. The paper ends with a conclusion in Section 5.<br />
82
A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
2 Related Work<br />
Two o<strong>the</strong>r approaches as <strong>the</strong> one presented in this paper also used slack allocation to<br />
optimize cost and lifetime. The first one focuses on minimizing <strong>the</strong> area while processing<br />
elements are selected. Then, changes to processor selection are made to get an increase<br />
in lifetime [10]. The o<strong>the</strong>r one works similar as <strong>the</strong> presented approach but does not use<br />
storage slack [5].<br />
The meta-heuristic simulated annealing was first introduced in [8] and [2]. There it<br />
was used to find an approximation to <strong>the</strong> NP-complete traveling salesman problem. It has<br />
been shown that <strong>the</strong> task allocation problem is also NP-complete and so <strong>the</strong> authors <strong>of</strong> [7]<br />
tried to adapt simulated annealing to this problem.<br />
Ant colony optimization is also a meta-heuristic and was first described in [4]. Prior<br />
to <strong>the</strong> work <strong>of</strong> [6], which is presented in this paper, ACO has been used to solve task<br />
mappings in [1] and [3]. There, performance and not system lifetime was optimized, in<br />
contrast to [6].<br />
3 Lifetime Improvements in Embedded Systems<br />
In this section <strong>the</strong> previously introduced approaches are described in detail. In this paper<br />
we focus on <strong>the</strong> ACO-based task mapping. To allow a comparison, first <strong>the</strong> two o<strong>the</strong>r<br />
methods for lifetime improvement are presented. We take a close look at <strong>the</strong> system-level<br />
approach <strong>of</strong> slack allocation before we come to <strong>the</strong> task allocations based on simulated<br />
annealing and ant colony optimization.<br />
3.1 Lifetime Improvement by Slack Allocation<br />
One way to increase system lifetime <strong>of</strong> embedded systems is to provide additional, not<br />
directly required resources, called slack, which compensates for failed components. Both<br />
data and tasks are remapped and rescheduled to <strong>the</strong>se previously underused resources to<br />
avoid complete system failure. While this method gives <strong>the</strong> chance to survive <strong>the</strong> failure<br />
<strong>of</strong> single components, <strong>the</strong> drawback is that one have to invest in additional hardware. In a<br />
system as a whole <strong>the</strong>re are many possibilities at which point and how much slack should<br />
be allocated. The goal is to find a lifetime-cost Pareto-optimal front [9]. This means that<br />
a slack allocation has to be found that has be best trade<strong>of</strong>f between lifetime and cost.<br />
The authors <strong>of</strong> [9] focuse on embedded network-on-chip multiprocessor systems-onchip<br />
(NoC-based MPSoCs) and try to optimize system lifetime and system manufacturing<br />
cost by selecting where and how much slack to allocate. The challenge <strong>of</strong> finding an<br />
optimal slack allocation is that <strong>the</strong> number <strong>of</strong> possible allocations is exponential in <strong>the</strong><br />
number <strong>of</strong> resources [9]. They have developed a technique called Critical Quantity Slack<br />
Allocation (CQSA) to reach <strong>the</strong> goals.<br />
The lifetime <strong>of</strong> embedded systems can be increased at system level in three ways.<br />
First, execution slack can be allocated by replacing slow processors with faster proces-<br />
83
Andre Koza<br />
sors. Second, storage slack can be allocated by replacing small memories with bigger<br />
memories. Third, <strong>the</strong> communication architecture can be changed. For that, switches<br />
and links are added or modified, and additionally more processors and memories are put<br />
into <strong>the</strong> system. The task is now to determine how to increase lifetime cost-effectively.<br />
CQSA focuses on slack allocation and does not deal with changing <strong>the</strong> communication<br />
architecture.<br />
3.1.1 General Working <strong>of</strong> CQSA<br />
For CQSA to work, <strong>the</strong> following is assumed to be given. The computation, storage and<br />
communication requirements are known for each task that is executed. There is also a<br />
fixed communication architecture for a single-chip multiprocessor. Last, an initial task<br />
mapping <strong>of</strong> computational task to processors, storage task to memories and communication<br />
to links and switches is given [9]. With this, CQSA determines a slack allocation that<br />
optimizes both system lifetime and cost.<br />
To survive a component failure enough slack has to be allocated. The amount <strong>of</strong> slack<br />
that is needed to compensate a failure <strong>of</strong> a component is defined as critical quantity <strong>of</strong><br />
slack for that component [9]. For a component C <strong>the</strong> critical quantity is described as es,<br />
ss where es means <strong>the</strong> execution slack and ss means <strong>the</strong> storage slack that is required<br />
for replacing <strong>the</strong> resources <strong>of</strong> C. These resources would become unreachable in case <strong>of</strong><br />
a failure. There is a distinction between processor, memory and switching components:<br />
While processors only have critical quantities <strong>of</strong> execution slack (es, 0), memories only<br />
have critical quantities <strong>of</strong> storage slack (0, ss). Switches can have both, execution and<br />
storage slack.<br />
The authors <strong>of</strong> CQSA state that it is most cost-effective to allocate slack around<br />
switches [9]. If slack is allocated to handle processor and memory failure, this allocation<br />
can at no additional cost also be used for <strong>the</strong> switch, which interconnects <strong>the</strong> processors<br />
and memories. By allocating slack for switches, <strong>the</strong> design space is partitioned and because<br />
switches connect many components, <strong>the</strong> complexity <strong>of</strong> CQSA only grows slowly<br />
with an increasing number <strong>of</strong> overall components.<br />
3.1.2 CQSA Algorithm<br />
The algorithm <strong>of</strong> CQSA consists <strong>of</strong> three stages. Stage 0 begins to allocate execution<br />
slack to overcome single component failure <strong>of</strong> processors. To archive this, execution<br />
slack is greedily increased until <strong>the</strong> smallest execution-slack-only critical quantity (es, 0)<br />
is reached [9]. That means that <strong>the</strong> amount <strong>of</strong> slack can at least cover for each single processor<br />
failure. Next, stage 1 also considers execution slack but now focuses on situations<br />
in which switches may fail. For switches that only need execution slack additional slack<br />
is allocated. For that to work each critical quantity (es, 0) with es > 0 is considered. In<br />
stage 2 additionally storage slack is considered. The stage is executed for each critical<br />
quantity (es, ss) with es ≥ 0 and ss > 0. At first, an exhaustive search is executed to find<br />
a slack allocation <strong>of</strong> (es, ss) that optimizes mean time to failure (MTTF). This allocation<br />
84
A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
is probably not <strong>the</strong> Pareto-optimal front because it only considers MTTF and ignores cost.<br />
The MTTF-optimized allocation is used as an initial slack allocation which is compared<br />
with o<strong>the</strong>r allocations. The algorithm <strong>the</strong>n executes a loop that computes two new allocations<br />
for comparison. In <strong>the</strong> first one, execution slack is greedily increased (with regard<br />
to MTTF) and in <strong>the</strong> second one, storage slack is greedily increased (also with regard to<br />
MTTF). Then, that allocation (from <strong>the</strong> three computed ones) is selected that has <strong>the</strong> best<br />
cost-MTTF trade<strong>of</strong>f. The selected allocation is <strong>the</strong>n used as a starting point for ano<strong>the</strong>r<br />
iteration in <strong>the</strong> loop (and used in <strong>the</strong> comparison). This loop is repeated until no more<br />
allocations can be found.<br />
3.1.3 Evaluation <strong>of</strong> CQSA<br />
The authors used two setups to evaluate CQSA. In <strong>the</strong> first smaller setup <strong>the</strong>y did an exhaustive<br />
search for <strong>the</strong> global Pareto-optimal allocation <strong>of</strong> slack. They compared <strong>the</strong><br />
Pareto-optimal with <strong>the</strong> allocation found by CQSA. In <strong>the</strong> second setup <strong>the</strong>y used a<br />
large benchmark to estimate <strong>the</strong> scaling <strong>of</strong> CQSA. Additionally to <strong>the</strong> comparison with<br />
<strong>the</strong> Pareto-optimal allocation, three o<strong>the</strong>r slack allocation approaches were compared<br />
to CQSA: Optimal execution slack allocation (Optimal ESA), greedy slack allocation<br />
(Greedy SA) and random slack allocation (Random SA). In Optimal SA a set <strong>of</strong> Paretooptimal<br />
designs that only allocates execution slack is found. In Greedy SA execution and<br />
storage slack is added greedily in iterations. Each iteration selects that allocation that<br />
has <strong>the</strong> best cost-lifetime trade<strong>of</strong>f. In Random SA a random allocation <strong>of</strong> all possible<br />
allocations is chosen.<br />
As a result <strong>the</strong> authors observed that <strong>the</strong>ir approach is <strong>the</strong> most accurate in case <strong>of</strong> <strong>the</strong><br />
first setup where <strong>the</strong> optimal result found by exhaustive search was used as a reference.<br />
CQSA finds allocations within 1.81% <strong>of</strong> <strong>the</strong> optimum while exploring only 1.7% <strong>of</strong> <strong>the</strong><br />
design space. The o<strong>the</strong>r approaches all had worse results.<br />
In <strong>the</strong> larger setup <strong>the</strong> authors used <strong>the</strong> best found allocation by all approaches as<br />
observed optimum (as exhaustive search is impractical due to <strong>the</strong> large size <strong>of</strong> <strong>the</strong> setup).<br />
In that benchmark again CQSA showed <strong>the</strong> best results. Ano<strong>the</strong>r important observation<br />
was that <strong>the</strong> number <strong>of</strong> allocations that CQSA evaluated grows only by a factor <strong>of</strong> 10<br />
while <strong>the</strong> whole design space increased by a factor <strong>of</strong> 10 5 .<br />
To sum up over all examples, CQSA found slack allocations within 1.4% <strong>of</strong> <strong>the</strong><br />
lifetime-cost Pareto-optimal while only exploring 1.4% <strong>of</strong> <strong>the</strong> design space [9] on average.<br />
In <strong>the</strong> smaller benchmark CQSA was able in increase system lifetime by 22%. The<br />
authors however do not mention at what cost this lifetime improvement could be achieved.<br />
Only in one example run <strong>the</strong>y explicitly mention that <strong>the</strong>y improved lifetime by 50% at<br />
a 62% cost increase. This also shows <strong>the</strong> big drawback <strong>of</strong> slack allocation. One has to<br />
invest a significant amount <strong>of</strong> money to increase system lifetime. In <strong>the</strong> next two sections<br />
we will present methods where no additional investments in hardware must be made to<br />
receive an improvement in lifetime.<br />
85
Andre Koza<br />
3.2 Simulated Annealing<br />
In contrast to <strong>the</strong> previous introduced approach to increase lifetime by slack allocation, in<br />
this section a method is presented that targets <strong>the</strong> task allocation and scheduling process<br />
for lifetime improvement. In [7] <strong>the</strong> authors state that if tasks are allocated in a way that<br />
some processors are more used than o<strong>the</strong>rs, <strong>the</strong>y will age faster and eventually fail earlier.<br />
If <strong>the</strong>se processors are mandatory for <strong>the</strong> system, <strong>the</strong>y become a reliability bottleneck and<br />
reduce overall system lifetime. To handle this <strong>the</strong> authors developed a lifetime reliabilityaware<br />
task allocation and scheduling algorithm for MPSoCs. This algorithm is based on<br />
<strong>the</strong> nature inspired technique simulated annealing (SA).<br />
Task allocations in prior work that seek to increase system lifetime focused mainly<br />
on reducing <strong>the</strong> system temperature due to <strong>the</strong> strong relationship between temperature<br />
and lifetime [7]. It has been shown, however, that when only regarding temperature <strong>the</strong><br />
lifetime <strong>of</strong> embedded systems is not essentially increased [6]. Thus <strong>the</strong> authors propose to<br />
take o<strong>the</strong>r factors such as internal structure, operational frequency or voltage into account<br />
in a lifetime-aware task allocation. They investigated what errors can happen and how to<br />
increase lifetime reliability <strong>of</strong> embedded systems. As a result <strong>the</strong>y came to <strong>the</strong> conclusion<br />
that avoiding permanent hard errors leads to <strong>the</strong> best reliability and <strong>the</strong>refore lifetime<br />
improvement. The work is focused on time dependent dielectric breakdown, electromigration<br />
and negative bias temperature instability. They used <strong>the</strong>se failure mechanisms to<br />
estimate <strong>the</strong> MTTF <strong>of</strong> <strong>the</strong>ir systems.<br />
The problem <strong>of</strong> allocating tasks to processors is NP-complete [7]. Thus unless for<br />
very small problems exact approaches cannot be realized in an acceptable runtime. To<br />
overcome this <strong>the</strong> authors developed a heuristic approach based on SA to solve <strong>the</strong> task<br />
scheduling problem.<br />
3.2.1 Simulated Annealing Algorithm<br />
Simulated annealing is a meta-heuristic to find approximations to a global optimum <strong>of</strong><br />
very large functions in which exhaustive search is infeasible in an appropriate runtime.<br />
To find an approximation <strong>of</strong> an optimum solution <strong>of</strong> a problem in SA a random initial<br />
solution is chosen at <strong>the</strong> beginning. In case <strong>of</strong> a task allocation a random (valid) allocation<br />
<strong>of</strong> tasks to processors is chosen. Valid means that no predecessor criterions and<br />
deadlines are violated. That solution is probably not <strong>the</strong> optimum. In <strong>the</strong> next step <strong>of</strong> <strong>the</strong><br />
algorithm one single random change in <strong>the</strong> task allocation is executed. If <strong>the</strong> new allocation<br />
is better (i.e. nearer to <strong>the</strong> optimum than <strong>the</strong> previous solution) <strong>the</strong>n it is always<br />
accepted. On <strong>the</strong> o<strong>the</strong>r hand, if <strong>the</strong> solution becomes worse, it is only accepted with a<br />
certain probability. This probability is influenced by a variable called temperature. The<br />
higher <strong>the</strong> temperature <strong>the</strong> higher <strong>the</strong> probability that a worse solution is accepted. This is<br />
done because o<strong>the</strong>rwise <strong>the</strong> algorithm could get stuck in a local minima. The temperature<br />
starts at a high value and decreases over time via a cooling rate until an end temperature<br />
is reached. At each temperature <strong>the</strong> algorithm makes a certain number <strong>of</strong> moves before<br />
<strong>the</strong> temperature is decreased. With lower temperature <strong>the</strong> probability that worse solutions<br />
are accepted decreases. At <strong>the</strong> beginning <strong>of</strong> <strong>the</strong> algorithm <strong>the</strong> choice is nearly random<br />
86
A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
Figure 1: Example <strong>of</strong> a simple task graph (taken from [7])<br />
(a) G (b) G<br />
Figure 2: Example <strong>of</strong> task graph transformations (taken from [7])<br />
if a worse solution is accepted. At <strong>the</strong> end <strong>the</strong> probability is very small and almost only<br />
improvements are accepted. If SA is run infinitely it will eventually output <strong>the</strong> optimum<br />
result. It had been shown that SA finds good approximations for <strong>the</strong> traveling salesman<br />
problem [2] [8] and in [7] it is adapted to find lifetime-aware task allocations.<br />
3.2.2 SA-based Task Allocation<br />
To describe a task allocation a directed acyclic task graph G = (V, E) is used where each<br />
node v ∈ V represents a task and each edge e ∈ E represents a precedence constraint.<br />
An illustration <strong>of</strong> a task graph can be found in Figure 1. A task allocation is <strong>the</strong>n represented<br />
as (schedule order sequence; resource assignment sequence). An example could<br />
be (0, 2, 1, 3, 4; P1, P1, P2, P1, P2). There are five tasks and two processors (P1 and P2).<br />
Task 0 is <strong>the</strong> first one scheduled and followed by 2, 1, 3 and 4. Tasks 0, 2 and 3 are<br />
executed on processor P1 and tasks 1 and 4 on P2 [7].<br />
To find new solutions from a random initial solution within <strong>the</strong> simulated annealing<br />
process, graph transformations are executed. First, <strong>the</strong>re is an expand task graph ˆG = (V,<br />
Ê). In this graph <strong>the</strong>re are <strong>the</strong> same nodes as in G but with additional edges. If G has a<br />
precedence constraint between two nodes, <strong>the</strong>re is a directed edge added in ˆG between<br />
<strong>the</strong>se two nodes. In <strong>the</strong> graph from Figure 1 <strong>the</strong>re would be an edge added from node 2<br />
to node 4. An illustration <strong>of</strong> ˆG resulting from G is given in Figure 2(a). Next, ano<strong>the</strong>r<br />
graph is created: an undirected complement graph ˜G = (V, ˜E). In this graph <strong>the</strong>re is an<br />
undirected edge (vi, vj) in ˜E if and only if <strong>the</strong>re is no precedence constraint between vi<br />
and vj [7] in ˆG. An illustration to this is shown in Figure 2(b).<br />
The authors define a valid schedule order as an order <strong>of</strong> tasks that conforms to <strong>the</strong><br />
partial order defined by task graph G. Fur<strong>the</strong>rmore, <strong>the</strong>y formulate a lemma as follows:<br />
“Given a valid schedule order A = (a1, a2, ..., a|v|), swapping adjacent nodes leads to<br />
ano<strong>the</strong>r valid schedule order, provided <strong>the</strong>re is an edge between those two nodes in graph<br />
˜G” [7]. Next, <strong>the</strong>y define a <strong>the</strong>orem as follows: “Starting from a valid schedule order A =<br />
87
Andre Koza<br />
(a1, a2, ..., a|v|), we are able to reach any o<strong>the</strong>r valid schedule order B = (b1, b2, ..., b|v|)<br />
after finite times <strong>of</strong> adjacent swapping” [7]. Then, to reach all possible solutions three<br />
kind <strong>of</strong> moves are presented that are used in <strong>the</strong> algorithm: “M1: Swap two adjacent<br />
nodes in both schedule order sequence and resource assignment sequence, if <strong>the</strong>re is an<br />
edge between <strong>the</strong>se two nodes in graph ˜G. M2: Swap two adjacent nodes in resource<br />
assignment sequence. M3: Change <strong>the</strong> resource assignment <strong>of</strong> a task” [7].<br />
With those definitions and <strong>the</strong> introduced moves, all possible task allocations can<br />
be reached. With M1, all o<strong>the</strong>r valid schedules can be reached, and with M2 and M3<br />
all resource assignments can be chosen. The authors set <strong>the</strong> temperature for simulated<br />
annealing to 100, <strong>the</strong> cooling rate to 0.95 and <strong>the</strong> end temperature to 10 −5 . At each<br />
temperature 1000 random moves are executed before <strong>the</strong> temperature is reduced. They<br />
decided if a found solution shows an improvement, if <strong>the</strong> MTTF <strong>of</strong> <strong>the</strong> system increases.<br />
For that a cost function is introduced, which reflects if a solution is valid and computes<br />
<strong>the</strong> MTTF according to <strong>the</strong> above mentioned failure mechanisms.<br />
3.2.3 Benchmarks <strong>of</strong> SA-based Task Allocation<br />
To test <strong>the</strong> lifetime improvements <strong>the</strong> authors generated random task graphs with a number<br />
<strong>of</strong> tasks from 20 to 260 and tested <strong>the</strong>m on different hypo<strong>the</strong>tical MPSoC platforms<br />
with 2 to 8 processors cores. They did <strong>the</strong> benchmarks on <strong>the</strong> SA-based task allocation<br />
and on one temperature-aware task scheduling algorithm based on list scheduling. The<br />
authors showed that <strong>the</strong>ir approach had better results than temperature-aware tasks mappings<br />
in terms <strong>of</strong> longer system lifetime. Depending on how many processors are used<br />
and how many tasks have to be mapped SA showed improvements from 0% - 81.81%.<br />
The results show that <strong>the</strong> more tasks have to be mapped and <strong>the</strong> more processor cores are<br />
used <strong>the</strong> better <strong>the</strong> improvement <strong>of</strong> SA gets.<br />
All in all, <strong>the</strong> simulated annealing based task allocation improves system lifetime<br />
when compared to a task allocation that only regards temperature. The authors however<br />
did not compare <strong>the</strong>ir approach to o<strong>the</strong>r lifetime-aware task mappings. Fur<strong>the</strong>r, <strong>the</strong>re is no<br />
benchmark which shows <strong>the</strong> lifetime increase when compared to a random task mapping<br />
which ignores lifetime. Compared to slack allocation, this method does not need fur<strong>the</strong>r<br />
investments in additional hardware.<br />
3.3 ACO-based Task Mapping<br />
In this section a method for increasing lifetime in embedded systems that focuses on task<br />
mappings is presented. The authors <strong>of</strong> [6] have developed a lifetime-aware task mapping<br />
technique based on ant colony optimization (ACO). Compared with o<strong>the</strong>r approaches like<br />
slack allocation <strong>the</strong> authors wanted to develop a method that does not increase system<br />
cost.<br />
O<strong>the</strong>r approaches that seek to increase system lifetime by task mappings focused on<br />
task mappings that optimize system temperature. It has been shown that <strong>the</strong>re is a strong<br />
relationship between system temperature and system lifetime. Therefore reducing temper-<br />
88
A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
ature can result in a better lifetime [6]. However, <strong>the</strong> authors discovered a high fluctuation<br />
in lifetime when only considering temperature. So <strong>the</strong>y came to <strong>the</strong> conclusion that o<strong>the</strong>r,<br />
additional factors, which influence <strong>the</strong> task mapping, have to be considered when lifetime<br />
optimization is a goal.<br />
In general, finding an optimal task mapping has proven to be a NP-complete problem<br />
[1]. To handle this, a heuristic approach is needed that finds a solution that should be<br />
very close to <strong>the</strong> optimum. For that <strong>the</strong> authors developed a task mapping based on ant<br />
colony optimization. They decided for ACO because in <strong>the</strong> past task mappings have<br />
been effectively solved with ACO and it is usable in a changing environment (failure <strong>of</strong><br />
components).<br />
3.3.1 Problem Definition<br />
The authors developed a lifetime-ware task mapping. In <strong>the</strong>ir approach a task mapping<br />
is application-dependent and defined as <strong>the</strong> assignment <strong>of</strong> tasks to processors and <strong>of</strong> data<br />
arrays to memories [6]. The general goal <strong>of</strong> task mapping is to optimize one or more<br />
objectives. Here <strong>the</strong> goal is to optimize system lifetime. For that different objectives have<br />
to be considered.<br />
Because <strong>of</strong> <strong>the</strong> strong dependence <strong>of</strong> component lifetime and component temperature,<br />
minimizing system temperature is one factor to be considered. For that ei<strong>the</strong>r <strong>the</strong> peak<br />
system temperature Tmax or <strong>the</strong> average system temperature Tavg is minimized. Fur<strong>the</strong>rmore<br />
it is necessary not only to minimize overall temperature but it is also important to<br />
minimize component temperature. For example, if overall temperature is low, but one<br />
essential component experiences high temperature and fails early, as a result <strong>the</strong> system<br />
fails.<br />
When only regarding temperature different physical factors that can have an influence<br />
on system lifetime are ignored. To overcome this <strong>the</strong> authors <strong>of</strong> [6] use electromigration,<br />
time-dependent dielectric breakdown and <strong>the</strong>rmal cycling as additional factors in <strong>the</strong>ir<br />
approach. These three factors influence system MTTF and are called wearout-related<br />
permanent faults.<br />
With <strong>the</strong> use <strong>of</strong> temperature and different physical parameters to address component<br />
failure a lifetime-aware task mapping is designed. That task mapping is based on ACO,<br />
which is described in <strong>the</strong> next paragraph.<br />
3.3.2 Ant Colony Optimization<br />
Ant colony optimization (ACO) is a nature-inspired approach in which artificial ants explore<br />
paths in <strong>the</strong> solution space <strong>of</strong> a problem by leaving pheromone trails in which information<br />
about <strong>the</strong> quality <strong>of</strong> <strong>the</strong> path is stored [1]. Nature inspired means that natural<br />
processes are imitated.<br />
In ACO <strong>the</strong> indirect communication <strong>of</strong> ants when <strong>the</strong>y explore new food sources is<br />
imitated. An ant swarm can find shortest paths between food sources and <strong>the</strong> nest by this<br />
indirect communication [1]. When ants are moving out <strong>the</strong>y emit a chemical substance,<br />
89
Andre Koza<br />
which is called pheromone. The amount <strong>of</strong> pheromone on a trail increases <strong>the</strong> more<br />
ants takes <strong>the</strong> same path. Following ants can detect <strong>the</strong> pheromone and <strong>the</strong> higher <strong>the</strong><br />
pheromone concentration on a path <strong>the</strong> higher is <strong>the</strong> probability that an ant will take that<br />
path. To avoid a convergence against a certain path at an early stage <strong>of</strong> <strong>the</strong> exploration<br />
process in nature <strong>the</strong> pheromone evaporates over time. By evaporation, paths that are <strong>of</strong><br />
no fur<strong>the</strong>r use will be ignored as all <strong>of</strong> <strong>the</strong> pheromone on it fades away eventually [1].<br />
This behavior from nature is adapted in an artificial way to optimize a constructive<br />
search process for combinatorial problems [1]. Artificial ants explore a search space and<br />
when <strong>the</strong>y take a path that leads to a good solution <strong>the</strong>y leave an artificial pheromone trail<br />
on it so that following ants will take that path with a higher probability than o<strong>the</strong>r paths.<br />
3.3.3 Task Mapping<br />
To adapt this to <strong>the</strong> task mapping problem, <strong>the</strong> authors <strong>of</strong> [6] developed an approach based<br />
on ACO. In <strong>the</strong> following, some basics are introduced that are needed for <strong>the</strong> method.<br />
First, <strong>the</strong> task mapping requires a system description. This consists <strong>of</strong> a list <strong>of</strong> components<br />
including <strong>the</strong>ir capacities and links between <strong>the</strong>m [6]. Second, a task graph is needed<br />
which consists <strong>of</strong> a list <strong>of</strong> tasks including requirements and communication rates for that<br />
tasks. The authors <strong>the</strong>n defined <strong>the</strong>ir goal as follows: “Our goal is to determine <strong>the</strong> initial<br />
mapping <strong>of</strong> tasks to processors and data arrays to memories which results in <strong>the</strong> longest<br />
system lifetime” [6]. They only define an initial task mapping and do not care about<br />
efficient remapping <strong>of</strong> tasks in <strong>the</strong>ir paper.<br />
The ACO strategy is implemented via a construction graph (see Figure 3). This graph<br />
consists <strong>of</strong> nodes and directed edges. The set <strong>of</strong> nodes contains all system components<br />
and all tasks <strong>of</strong> <strong>the</strong> application. There are two types <strong>of</strong> edges: decision edges connect<br />
components to tasks and mapping edges connect tasks to components.<br />
The graph is traversed by artificial ants. At <strong>the</strong> beginning, a decision edge is chosen<br />
which ends in a task. This task is <strong>the</strong> first to be executed. Next, <strong>the</strong> ant choses a mapping<br />
edge that connects <strong>the</strong> task to a component. By this, a single task-to-component mapping<br />
is done. After that, ano<strong>the</strong>r decision edge is taken. This process is repeated until all<br />
tasks are mapped to components. An illustration <strong>of</strong> <strong>the</strong> process is given in Figure 3.<br />
There is a task graph containing all tasks, and a communication architecture containing<br />
all components, which are connected via a switch. The colors indicate associated tasks<br />
and components. At <strong>the</strong> beginning <strong>of</strong> <strong>the</strong> mapping, an ant starts at node T1 and choses<br />
one <strong>of</strong> <strong>the</strong> four mapping edges. In this case, <strong>the</strong> ant selects <strong>the</strong> edge that ends in node C2.<br />
That decision means that task T1 is executed on component C2. After that, again a task is<br />
chosen until all tasks are mapped to components.<br />
The ants choose edges by a weighted, random selection. The weight <strong>of</strong> an edge depends<br />
on <strong>the</strong> amount <strong>of</strong> pheromone on it. By this, ants will take paths that has been shown<br />
to be part <strong>of</strong> a good solution in <strong>the</strong> past with a higher probability than o<strong>the</strong>r paths. This<br />
procedure also allows for selecting o<strong>the</strong>r paths in order to search for new solutions that<br />
might be better than older ones. The evaporation <strong>of</strong> pheromones avoids that <strong>the</strong> algorithm<br />
gets stuck in a local minima.<br />
90
A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
Figure 3: Task mapping process after completion (taken from [6])<br />
Figure 4: Overview <strong>of</strong> <strong>the</strong> ACO-based task mapping (taken from [6])<br />
After an ant has traversed <strong>the</strong> construction graph, <strong>the</strong> found solution is checked for validity<br />
and gets a score. Details about <strong>the</strong> validity check and <strong>the</strong> scoring follow in <strong>the</strong> next<br />
paragraphs. An illustration <strong>of</strong> <strong>the</strong> whole task mapping process can be found in Figure 4.<br />
Beginning with an ant traversing <strong>the</strong> construction graph, a mapping is found. This phase<br />
is called task mapping syn<strong>the</strong>sis. After that, <strong>the</strong> task mapping is checked for validity. If<br />
<strong>the</strong> solution is valid, <strong>the</strong> lifetime <strong>of</strong> <strong>the</strong> mapping is evaluated which results in a system<br />
MTTF. Then <strong>the</strong> task mapping is evaluated by giving it a score which depends on <strong>the</strong> validity<br />
and <strong>the</strong> MTTF. Invalid mappings get a bad score while valid mappings get a score<br />
that reflect <strong>the</strong> MTTF. Only if <strong>the</strong> found solution has <strong>the</strong> best score so far, <strong>the</strong> construction<br />
graph is fed by pheromones.<br />
3.3.4 Task Mapping Evaluation<br />
After an ant has traversed <strong>the</strong> construction graph, <strong>the</strong> resulting task mapping must be<br />
evaluated. For that it has to be checked if <strong>the</strong> mapping is valid. Valid means on <strong>the</strong><br />
one hand that no component capacities have been violated. Component capacities for<br />
processors are given in MIPS (million instructions per second) and for memories in KB<br />
(kilobyte). Then, for compute tasks processing requirements and for data tasks storage<br />
requirements are determined and possible violations are identified. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong><br />
communication traffic between tasks is checked to determine if any bandwidth capacities<br />
have been violated. If both component capacity and bandwidth capacity are not violated,<br />
<strong>the</strong> solution is valid.<br />
91
Andre Koza<br />
To determine <strong>the</strong> MTTF <strong>of</strong> a valid solution <strong>the</strong> authors <strong>of</strong> [6] use a system lifetime<br />
model which is described as follows. System lifetime resulting from a task mapping is defined<br />
as <strong>the</strong> amount <strong>of</strong> time between powering up a system and failure <strong>of</strong> a system so that<br />
its performance constraints can no longer be satisfied [6]. The performance constraints<br />
can be fulfilled until no more valid task re-mappings exist.<br />
The physical factors, which were listed in Section 3.3.1, are used for an estimation<br />
<strong>of</strong> permanent component failures due to wearout. The authors used a lognormal failure<br />
distribution for each <strong>of</strong> <strong>the</strong> factors and normalized <strong>the</strong>m so that MTTF is 30 years for <strong>the</strong><br />
characterization temperature <strong>of</strong> 345 K [6].<br />
In <strong>the</strong> next step component temperatures have to be determined in order to acquire<br />
component MTTF and <strong>the</strong> resulting system MTTF. The temperature <strong>of</strong> a component depends<br />
on <strong>the</strong> utilization and <strong>the</strong> power dissipation. The utilization <strong>of</strong> a component is<br />
determined based on <strong>the</strong> task mapping and <strong>the</strong> system description (a list <strong>of</strong> components<br />
including <strong>the</strong>ir capacities and links between <strong>the</strong>m, see above). With this data <strong>the</strong> component<br />
power dissipation can be acquired that leads to temperatures for each component.<br />
The temperature can <strong>the</strong>n be used to determine component MTTF based on <strong>the</strong> above<br />
mentioned normalized MTTF <strong>of</strong> 30 years for a temperature <strong>of</strong> 345 K. As this seminar<br />
paper focuses on lifetime improvement by <strong>the</strong> ACO technique, details on how component<br />
power dissipation and <strong>the</strong> resulting temperatures are determined, are omitted.<br />
Overall system MTTF is <strong>the</strong>n determined by an iterative simulation. In each iteration<br />
failure times <strong>of</strong> components are randomly selected based on <strong>the</strong> task mapping, component<br />
utilization and temperature. This means that not <strong>the</strong> MTTF <strong>of</strong> a component is chosen but<br />
one concrete failure time. In case <strong>of</strong> a component failure <strong>the</strong> remaining tasks and data<br />
are remapped. This remapping process is not lifetime-aware. It is <strong>the</strong>n checked if <strong>the</strong><br />
remapping still satisfies system performance constraints. If this is <strong>the</strong> case component<br />
utilization and temperature based on <strong>the</strong> remapping are calculated again and <strong>the</strong> received<br />
data is used in <strong>the</strong> next iteration. This is repeated until <strong>the</strong> system fails. This process is<br />
executed for several sample systems and at <strong>the</strong> end system MTTF is determined by <strong>the</strong><br />
mean <strong>of</strong> all sample failure times.<br />
System MTTF is used for scoring a solution <strong>of</strong> a task mapping. The score equals<br />
<strong>the</strong> ratio <strong>of</strong> <strong>the</strong> MTTF to a baseline MTTF for that system. The authors however do not<br />
mention how to obtain <strong>the</strong>se baseline MTTFs. They only write that <strong>the</strong>y are obtained by<br />
using task mappings created by hand for example systems. Invalid solutions are scored so<br />
that <strong>the</strong>y are never chosen above a valid solution.<br />
The scoring is used for placing pheromones on edges <strong>of</strong> <strong>the</strong> construction graph. The<br />
amount <strong>of</strong> pheromones equals <strong>the</strong> score. Pheromones are only deposited on <strong>the</strong> path if <strong>the</strong><br />
score <strong>of</strong> a solution is <strong>the</strong> highest found so far. To simulate <strong>the</strong> evaporation <strong>of</strong> pheromones<br />
over time, each time when an ant has explored a new task mapping and <strong>the</strong> score <strong>of</strong> it<br />
is computed, <strong>the</strong> pheromones on <strong>the</strong> edges (i.e. <strong>the</strong> edge weights) experience decay [6].<br />
That means <strong>the</strong> weight changes by a certain percentage that depends on <strong>the</strong> amount <strong>of</strong><br />
valid task mappings.<br />
92
A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
3.3.5 Benchmarks<br />
The ACO-based task mapping and a simulated annealing based task mapping are benchmarked<br />
in order to compare <strong>the</strong> resulting system MTTF. As benchmark applications <strong>the</strong><br />
authors <strong>of</strong> [6] used a syn<strong>the</strong>tic application (synth), Multi-window Display (MWD) and an<br />
MPEG-4 Core Pr<strong>of</strong>ile Level 1 decoder (CPL1).<br />
For <strong>the</strong> benchmarks two variants <strong>of</strong> <strong>the</strong> ACO-based approach and two variants <strong>of</strong> an<br />
simulated annealing (SA) based approach were used. SA is chosen for comparison as it<br />
here represents a temperature-aware task mapping approach. It is important to mention<br />
that this is not <strong>the</strong> SA approach described in Section 3.2.<br />
The first variation <strong>of</strong> <strong>the</strong> ACO-based approach, called agnosticAnts, simulates a random<br />
selection <strong>of</strong> a task mapping. For that one single valid task mapping is generated<br />
before <strong>the</strong> search is stopped. Because at <strong>the</strong> beginning no pheromone trails have been laid<br />
out, all possible solutions have <strong>the</strong> same probability to be chosen as first one.<br />
The second approach used in <strong>the</strong> benchmarks is lifetimeAnts. There, a lifetime-aware<br />
task mapping is executed and <strong>the</strong> ants explore 20 valid task mappings before <strong>the</strong> search is<br />
stopped. The task mapping with <strong>the</strong> highest MTTF is chosen as a result <strong>of</strong> this approach.<br />
The authors chose <strong>the</strong> value 20 because according to experiments with higher and lower<br />
numbers this value shows a good trade<strong>of</strong>f between MTTF and runtime.<br />
Next, two variations <strong>of</strong> SA were used in <strong>the</strong> benchmarks. The first one, called avgSA<br />
finds task mappings with optimized average initial component temperature. The second<br />
one, maxSA, emphasizes <strong>the</strong> optimization <strong>of</strong> <strong>the</strong> maximum initial component temperature.<br />
The SA-based approaches were stopped when <strong>the</strong>y reached 50 valid task mappings.<br />
The authors used different design points for each benchmark. A design point is a<br />
communication architecture that consists <strong>of</strong> different processors and memories that are<br />
interconnected via switches. Different design points can have <strong>the</strong> same communication<br />
architecture but differ in <strong>the</strong> types <strong>of</strong> processors and / or memories. Additionally <strong>the</strong>y<br />
introduced different amounts <strong>of</strong> slack in each design point according to <strong>the</strong> method presented<br />
in [9].<br />
In <strong>the</strong> following paragraphs <strong>the</strong> results <strong>of</strong> <strong>the</strong> benchmarks are shown. At first <strong>the</strong> syn<strong>the</strong>tic<br />
application is evaluated. The authors designed that application small enough so that<br />
an exhaustive search <strong>of</strong> all possible valid task mappings is practicable. They compared<br />
<strong>the</strong> best found MTTF from <strong>the</strong> exhaustive search with that found from lifetimeAnts and<br />
observed that lifetimeAnts was able to create task mappings with an equivalent MTTF. For<br />
this benchmark <strong>the</strong>y used 16 different design points.<br />
After that, <strong>the</strong> authors executed so called real world benchmarks with MWD and<br />
CPL1. In this benchmarks optimal results could not be obtained due to <strong>the</strong> large number<br />
<strong>of</strong> possible valid task mappings. For example, even in <strong>the</strong> smallest <strong>of</strong> <strong>the</strong> real world<br />
benchmarks, <strong>the</strong> authors computed 1.224e10 possible valid task mappings. To overcome<br />
this, <strong>the</strong>y executed all four approaches several hundred times to receive an observed optimal<br />
task mapping which acted as a reference. This observed optimal task mapping is <strong>the</strong><br />
one with <strong>the</strong> highest system MTTF found by all runs <strong>of</strong> all approaches.<br />
The results <strong>of</strong> <strong>the</strong> real world benchmarks are shown in Table 1. The percentages in<br />
93
Andre Koza<br />
Benchmark agnosticAnts lifetimeAnts avgSA maxSA<br />
MWD 4-s 65.6% 77.3% 83.4% 82.4%<br />
CPL1 4-s 61.4% 83.9% 81.8% 81.8%<br />
CPL1 5-s 64.0% 85.1% 84.3% 83.1%<br />
Table 1: Benchmark <strong>of</strong> task mapping approaches as a percentage <strong>of</strong> <strong>the</strong> observed optimal<br />
results. Taken from [6].<br />
Benchmark Max. Initial Temp. Avg. Initial Temp.<br />
Avg Max Avg Max<br />
MWD 4-s 27.4% 44.3% 32.3% 47.9%<br />
CPL1 4-s 17.5% 24.5% 33.5% 53.2%<br />
CPL1 5-s 15.3% 23.2% 31.9% 101.7%<br />
Table 2: The percentages are lifetime ranges within 1% <strong>of</strong> <strong>the</strong> observed optimal temperature.<br />
This table shows a great variation <strong>of</strong> lifetimes even if <strong>the</strong> temperature interval is<br />
small. Taken from [6].<br />
<strong>the</strong> columns show <strong>the</strong> fraction <strong>of</strong> <strong>the</strong> observed optimal lifetime. For example lifetimeAnts<br />
in <strong>the</strong> benchmark CPL1 4-s (4 switches) reached 83.9% <strong>of</strong> <strong>the</strong> lifetime <strong>of</strong> <strong>the</strong> observed<br />
optimum. These percentages are averaged across all used design points in a benchmark.<br />
The benchmark shows that lifetimeAnts outperformed agnosticAnts in all test cases. The<br />
results <strong>of</strong> avgSA and maxSA are nearly <strong>the</strong> same as lifetimeAnts.<br />
The authors did ano<strong>the</strong>r evaluation <strong>of</strong> <strong>the</strong>ir benchmarks in which <strong>the</strong>y compared <strong>the</strong><br />
lifetime ranges for task mappings with temperatures within 1% <strong>of</strong> <strong>the</strong> observed optimum<br />
temperature. The results <strong>of</strong> this can be found in Table 2. The first column reflects <strong>the</strong><br />
benchmark application. The second column labeled with Max. Initial Temp. shows<br />
<strong>the</strong> lifetime ranges <strong>of</strong> all approaches within 1% <strong>of</strong> <strong>the</strong> observed optimal maximum initial<br />
component temperature. This lifetime ranges are on <strong>the</strong> one hand averaged and on<br />
<strong>the</strong> o<strong>the</strong>r hand <strong>the</strong> maximum range is shown. The third column shows <strong>the</strong> same for <strong>the</strong><br />
observed optimal average initial component temperature. For example, <strong>the</strong> maximum<br />
lifetime range for all tasks mappings which result in a maximum initial component temperature<br />
within 1% <strong>of</strong> <strong>the</strong> lowest is 44.3% for MWD 4-s. From this table <strong>the</strong> following<br />
result can be drawn. Task mappings which result in a low system temperature are <strong>of</strong>ten<br />
not optimized in lifetime. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> authors observed in <strong>the</strong>ir benchmarks<br />
that task mappings which result in a high system lifetime also result in a low temperature.<br />
According to that, <strong>the</strong>y came to <strong>the</strong> conclusion that temperature-aware task mapping is<br />
a subset <strong>of</strong> lifetime-aware task mapping because temperature-aware task mappings only<br />
find task mappings which are optimized in temperature but not necessarily in lifetime<br />
while lifetime-aware task mappings show good results both in lifetime and in temperature.<br />
To sum up, ACO-based task mapping showed an improvement <strong>of</strong> 32.3% in lifetime<br />
when compared to a random task mapping approach [6]. This is achieved with no additional<br />
investment in hardware. The authors focused <strong>the</strong>ir work on <strong>the</strong> comparison between<br />
94
A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
lifetime-aware and temperature-aware task mappings and concluded that when only regarding<br />
temperature <strong>the</strong>re is a high fluctuation <strong>of</strong> system lifetime.<br />
4 Comparison<br />
In this section <strong>the</strong> previously presented approaches to increase lifetime in embedded systems<br />
are compared to each o<strong>the</strong>r. The first approach we looked at was slack allocation.<br />
There additional resources are brought into <strong>the</strong> system that cover for potential future failures<br />
<strong>of</strong> components. In case <strong>of</strong> a failure, tasks and data <strong>of</strong> failed components are remapped<br />
to <strong>the</strong> slack resources. The goal <strong>of</strong> this approach is to find lifetime-cost Pareto-optimal<br />
slack allocations. As a result <strong>the</strong> approach CQSA found slack allocations within 1.4% <strong>of</strong><br />
<strong>the</strong> Pareto-optimal while only exploring 1.4% <strong>of</strong> <strong>the</strong> design space on average. Lifetime<br />
could be increased by 22% in a small benchmark. There is no data provided for real world<br />
benchmarks as in Section 3.3.<br />
The next two approaches we presented are focused on task mappings to increase lifetime.<br />
Both adapted nature inspired methods for <strong>the</strong> task mapping problem. Also both<br />
approaches considered not only system temperature to increase lifetime but additional<br />
physical failure mechanisms. The simulated annealing technique provides task mappings<br />
that showed lifetime improvements compared to a temperature-aware method. The results<br />
<strong>of</strong> this method vary from 0% (only in one benchmark) to up to 81.81% depending on how<br />
much tasks had to be mapped and how many processor cores were used.<br />
The third approach presented in this seminar paper was ACO-based task mapping.<br />
This approach was <strong>the</strong> focus <strong>of</strong> this work. The authors adapted <strong>the</strong> behavior <strong>of</strong> an ant<br />
swarm when searching new food sources for <strong>the</strong> task mapping problem. In benchmarks<br />
<strong>the</strong> ACO-based task mapping was compared to a random approach and to two SA-based<br />
temperature-aware approaches: one that targeted <strong>the</strong> average temperature and one that<br />
targeted <strong>the</strong> maximum temperature. As a result <strong>the</strong> ACO-based task mapping showed <strong>the</strong><br />
best lifetime improvements on average with <strong>the</strong> lowest runtimes. It reached 32.3% longer<br />
lifetime than a random task mapping approach.<br />
All three examined approaches showed lifetime improvements. The advantage <strong>of</strong><br />
<strong>the</strong> task mapping approaches is that no additional investments in hardware have to be<br />
made. It lacks on clear statements from <strong>the</strong> authors <strong>of</strong> [9] how much must be invested<br />
to achieve a certain lifetime improvement. Only in one mentioned example run <strong>the</strong>y<br />
received 50% more lifetime at a cost increase <strong>of</strong> 62%. This is compared to task mapping<br />
approaches where <strong>the</strong> increased hardware cost is at 0% a lot. There is a benchmark needed<br />
for a meaningful comparison <strong>of</strong> <strong>the</strong> SA-based approach and <strong>the</strong> ACO-based approach.<br />
Both approaches did benchmarks in which <strong>the</strong>y were compared to temperature-aware<br />
task mappings but both benchmarks differ a lot. For example <strong>the</strong> ACO-based approach<br />
is compared to two different temperature-aware approaches while <strong>the</strong> SA-based approach<br />
is only compared to one. Fur<strong>the</strong>rmore, in <strong>the</strong> ACO-based approach slack according to<br />
<strong>the</strong> method in [9] was used. From <strong>the</strong> available data <strong>the</strong>re cannot be drawn a conclusion<br />
which approach increases system lifetime most.<br />
95
Andre Koza<br />
5 Conclusion<br />
In this seminar paper we discussed three approaches to target system lifetime in embedded<br />
systems. There was on <strong>the</strong> one hand a system-level approach which improves lifetime<br />
by providing additional resources. On <strong>the</strong> o<strong>the</strong>r hand <strong>the</strong>re were two approaches which<br />
change <strong>the</strong> way resources are utilized within <strong>the</strong> system. Due to <strong>the</strong> lack <strong>of</strong> comparable<br />
benchmarks it cannot be said which approach has <strong>the</strong> best lifetime improvements. The<br />
advantage <strong>of</strong> task mappings over slack allocation is that <strong>the</strong>re is no additional cost for new<br />
hardware. As proposed in [6] a combination <strong>of</strong> slack allocation and lifetime-aware task<br />
mapping is promising. There a system benefits from both approaches and <strong>the</strong> designer <strong>of</strong><br />
a system can decide how much he is willing to invest in slack to receive an increase in<br />
lifetime.<br />
References<br />
[1] Markus Bank and Udo Honig. An ACO-based approach for scheduling task graphs<br />
with communication costs. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 2005 International Conference on<br />
Parallel Processing, pages 623–629, Washington, DC, USA, 2005. IEEE Computer<br />
Society.<br />
[2] V. Černý. Thermodynamical approach to <strong>the</strong> traveling salesman problem: An efficient<br />
simulation algorithm. Journal <strong>of</strong> Optimization Theory and Applications,<br />
45:41–51, 1985. 10.1007/BF00940812.<br />
[3] C.-W. Chiang, Y.-C. Lee, C.-N. Lee, and T.-Y. Chou. Ant colony optimisation for<br />
task matching and scheduling. Computers and Digital Techniques, IEE <strong>Proceedings</strong>,<br />
153(6):373 –380, nov. 2006.<br />
[4] M. Dorigo, V. Maniezzo, and A. Colorni. Ant system: optimization by a colony<br />
<strong>of</strong> cooperating agents. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE<br />
Transactions on, 26(1):29 –41, feb 1996.<br />
[5] M. Glass, M. Lukasiewycz, F. Reimann, C. Haubelt, and J. Teich. Symbolic reliability<br />
analysis and optimization <strong>of</strong> ecu networks. In Design, Automation and Test in<br />
Europe, 2008. DATE ’08, pages 158 –163, march 2008.<br />
[6] Adam S. Hartman, Donald E. Thomas, and Brett H. Meyer. A case for lifetimeaware<br />
task mapping in embedded chip multiprocessors. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> eighth<br />
IEEE/ACM/IFIP international conference on <strong>Hardware</strong>/s<strong>of</strong>tware codesign and system<br />
syn<strong>the</strong>sis, CODES/ISSS ’10, pages 145–154, New York, NY, USA, 2010. ACM.<br />
[7] Lin Huang, Feng Yuan, and Qiang Xu. Lifetime reliability-aware task allocation<br />
and scheduling for MPSoC platforms. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> Conference on Design,<br />
Automation and Test in Europe, DATE ’09, pages 51–56, 3001 Leuven, Belgium,<br />
Belgium, 2009. European Design and Automation Association.<br />
96
A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors<br />
[8] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.<br />
Science, 220(4598):671–680, 1983.<br />
[9] Brett H. Meyer, Adam S. Hartman, and Donald E. Thomas. Cost-effective slack<br />
allocation for lifetime improvement in NoC-based MPSoCs. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong><br />
Conference on Design, Automation and Test in Europe, DATE ’10, pages 1596–<br />
1601, 3001 Leuven, Belgium, Belgium, 2010. European Design and Automation<br />
Association.<br />
[10] Changyun Zhu, Zhenyu (Peter) Gu, Robert P. Dick, and Li Shang. Reliable multiprocessor<br />
system-on-chip syn<strong>the</strong>sis. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 5th IEEE/ACM international<br />
conference on <strong>Hardware</strong>/s<strong>of</strong>tware codesign and system syn<strong>the</strong>sis, CODES+ISSS<br />
’07, pages 239–244, New York, NY, USA, 2007. ACM.<br />
97
Warp processing<br />
Maryam Sanati<br />
University <strong>of</strong> Paderborn<br />
msanati@mail.uni-paderborn.de<br />
November 2011<br />
Abstract<br />
In this paper we talk about a framework for dynamic syn<strong>the</strong>sis <strong>of</strong> thread accelerators<br />
or thread warping. Warp processing is <strong>the</strong> process <strong>of</strong> converting typical s<strong>of</strong>tware<br />
instructions binary into an FPGA circuit binary dynamically for speedup. FPGAs are<br />
much faster than microprocessors because a microprocessor might be able to execute<br />
several operations in parallel while an FPGA can implement thousands <strong>of</strong> operations<br />
in parallel. Warp processing uses an on-chip processor to remap critical code regions<br />
from processor instructions to FPGA circuit using run-time syn<strong>the</strong>sis. This kind <strong>of</strong><br />
processing is building dynamic syn<strong>the</strong>sis for single process, single thread system.<br />
We can improve performance by thread warping which has <strong>the</strong> ability to adapt <strong>the</strong><br />
system to change <strong>the</strong> threads’ behavior, and different mixes <strong>of</strong> resident application.<br />
1 Introduction<br />
Here, we want to describe a new processing architecture known as warp processor. Warp<br />
processing gives a computer chip <strong>the</strong> ability to improve its performance. In this processor,<br />
a program runs on a microprocessor chip and <strong>the</strong> chip tries to detect <strong>the</strong> parts <strong>of</strong> program,<br />
which are executed frequently. Then it moves <strong>the</strong>se parts (<strong>the</strong> most frequently executed)<br />
to a field-programmable gate array (FPGA). An FPGA has <strong>the</strong> ability to execute some, but<br />
not all programs, 10,100 or even 1000 times faster than a microprocessor. As mentioned<br />
before FPGAs are faster for some programs, so if microprocessor finds out that FPGA<br />
is faster for a special part <strong>of</strong> <strong>the</strong> program it causes <strong>the</strong> program execution to warp, that<br />
means <strong>the</strong> microprocessor moves <strong>the</strong> selected part to FPGA.<br />
While some applications has no speedup on FPGAs, o<strong>the</strong>r highly parallelizable applications<br />
such as image processing, encryption, encoding, video/audio processing, ma<strong>the</strong>matical<br />
based simulations, and much more may perform 2x, 10x, 100x or even 1000x<br />
speed ups compared to fast microprocessors. Consumers who want to enhance <strong>the</strong>ir photos<br />
using Photoshop or edit videos on <strong>the</strong>ir PCs find <strong>the</strong>ir systems speedup with warp<br />
processing. Warp processing may eliminate tool flow restrictions and extra designer trade<br />
99
Maryam Sanati<br />
effort associated with traditional compile-time optimizations due to having optimization<br />
at runtime.<br />
2 Warp processing<br />
A warp processor dynamically detects <strong>the</strong> binaries critical regions, reimplement <strong>the</strong>m<br />
which causes in 2x to 100x speedup in comparison to executing on microprocessors. In<br />
general, s<strong>of</strong>tware bits are downloaded into hardware device. In traditional microprocessors<br />
<strong>the</strong>se bits represents sequential instructions that should be executed by programmable<br />
microprocessor. In FPGA, s<strong>of</strong>tware bits show a circuit to be mapped onto a logical fabric<br />
<strong>of</strong> an FPGA, which is configurable. In both situations, developers download <strong>the</strong> s<strong>of</strong>tware<br />
bits to prefabricated hardware device so, <strong>the</strong>y can implement <strong>the</strong>ir desired computation.<br />
Therefore, in both s<strong>of</strong>tware types <strong>the</strong>y do not need to design hardware.<br />
A computation might execute faster as a circuit on an FPGA than as sequential instructions<br />
on a microprocessor because a circuit allows concurrency, from <strong>the</strong> bit to <strong>the</strong> process<br />
level [1]. The most difficult part <strong>of</strong> <strong>the</strong> warp processing is dynamically reimplementing<br />
code regions on an FPGA which has many steps such as decompilation, partitioning, syn<strong>the</strong>sis,<br />
placement and routing and needs special tools for <strong>the</strong>se stages in order to minimize<br />
computation time and data memory comparing to main processor.<br />
From electrical view, programming <strong>of</strong> FPGA is <strong>the</strong> same as programming a microprocessor.<br />
Many research tools want to be able to compile popular high level programming<br />
languages such as C, C++, and Java to FPGA. Many <strong>of</strong> <strong>the</strong>se compilers use pr<strong>of</strong>iling to<br />
detect a kernels <strong>of</strong> a program, which are <strong>the</strong> most frequent executable part <strong>of</strong> <strong>the</strong> program,<br />
map those parts to circuit on an FPGA, and let <strong>the</strong> microprocessor to execute <strong>the</strong> rest <strong>of</strong><br />
<strong>the</strong> program.<br />
According to recent studies, <strong>the</strong>y showed that designers could do hardware/s<strong>of</strong>tware<br />
partitioning starting from binaries ra<strong>the</strong>r than from high-level code by using decompilation.<br />
In o<strong>the</strong>r words, warp processing is a process that an executing binary dynamically<br />
and transparently optimized by moving to an on-chip configurable logic.<br />
2.1 Component <strong>of</strong> warp processor<br />
Figure 1 provides an overview <strong>of</strong> a warp processor. The warp processor consists <strong>of</strong> a<br />
microprocessor which is our main processor and a warp-oriented FPGA (W-FPGA) sharing<br />
instructions and data caches or memory, an on-chip pr<strong>of</strong>iler, and an on-chip computer<br />
aided design module (Dynamic CAD tool). Initially, developer or end user downloads a<br />
program and it will execute only on main processor. During <strong>the</strong> execution <strong>of</strong> <strong>the</strong> application,<br />
pr<strong>of</strong>iler monitors <strong>the</strong> execution and dynamically detects <strong>the</strong> critical kernels. After<br />
binary’s kernel detection, <strong>the</strong> dynamic CAD tools map those critical regions to FPGA<br />
circuit. Then <strong>the</strong> binary updater updates <strong>the</strong> program binary to use new circuit. After updating<br />
took place, <strong>the</strong> execution time warps. That means <strong>the</strong> program’s execution speed<br />
up by a factor <strong>of</strong> two, 10 or even more.<br />
100
Figure 1: Warp processor architecture/overview<br />
Warp processing<br />
As we mentioned before pr<strong>of</strong>iler is in charge <strong>of</strong> monitoring application’s behavior to<br />
determine <strong>the</strong> critical kernels, which can be implemented as hardware by warp processor.<br />
Branch frequencies stored in a cache which pr<strong>of</strong>iler updates that, when a backward<br />
branch occurs. In this way, pr<strong>of</strong>iler is able to determine <strong>the</strong> critical kernels accurately. After<br />
detecting critical regions by pr<strong>of</strong>iling, <strong>the</strong> on-chip CAD module executes partitioning,<br />
syn<strong>the</strong>sis, mapping, and routing algorithm. Dynamic CAD first analyzes <strong>the</strong> pr<strong>of</strong>iling<br />
result that shows which critical kernels should be implemented in hardware. After selecting<br />
<strong>the</strong> binary kernels, CAD tool will decompile critical regions to control/data flow<br />
graph and syn<strong>the</strong>size <strong>the</strong> critical kernels to produce an optimized hardware circuit that is<br />
later mapped onto W-FPGA based on mapping, routing, and placement technology. Warp<br />
processors use executing binary code ra<strong>the</strong>r than source code to syn<strong>the</strong>size circuits. As<br />
binary code does not have high-level constructs such as loops, arrays, and functions, syn<strong>the</strong>sis<br />
from <strong>the</strong>m might produce slower or bigger circuits. We also have <strong>the</strong> alternative to<br />
replace on-chip CAD tools by a s<strong>of</strong>tware task on <strong>the</strong> main processor. This s<strong>of</strong>tware task<br />
is sharing computation and memory resource with <strong>the</strong> main application. We can also have<br />
multiprocessor system with multiple warp processors on a single device. In this case, we<br />
do not need multiple on-chip CAD tools. Here, instead <strong>of</strong> multiple on-chip CAD modules,<br />
just a single one is sufficient for supporting each <strong>of</strong> <strong>the</strong> processors in a round-robin<br />
fashion [2]. In this situation, we can also execute s<strong>of</strong>tware tasks instead <strong>of</strong> implementing<br />
CAD tools.<br />
Researchers found many decompilation techniques in order to recover high-level constructs<br />
such as loops, arrays, functions, etc. <strong>the</strong>re are two efficient techniques:<br />
101
Maryam Sanati<br />
Figure 2: Dynamic CAD tools<br />
-Loop rolling<br />
-Operator strength promotion<br />
Loop rolling detects an unrolled loop in a binary and replaces <strong>the</strong> code with a rerolled<br />
loop, thus letting a circuit syn<strong>the</strong>sizer unroll <strong>the</strong> loop by an amount that matches <strong>the</strong> available<br />
FPGA resources. Previous decompilation techniques also use loops to detect arrays<br />
and syn<strong>the</strong>sizers need arrays to effectively use FPGA smart buffers, which increase data<br />
reuse and thus decrease time-consuming memory access [1]. Also with loop rerolling<br />
technique we can significantly reduce <strong>the</strong> time for circuit syn<strong>the</strong>sis by <strong>the</strong> help <strong>of</strong> reducing<br />
control/data flow graph size. Operator strength promotion detects strength reduceoperation.<br />
That means in this technique operations like shifts and adds which are strength<br />
reduced operations replaced by a multiplication which is a stronger operation. Therefore,<br />
<strong>the</strong> compiler uses multiplier, which is fast functional unit if it is available on FPGA.<br />
Without our two new decompilation techniques, <strong>the</strong> binary approach would have<br />
yielded 33 percent less average speed up, with a worse case <strong>of</strong> 65 percent less. Without<br />
any decompilation, <strong>the</strong> binary approach actually yielded an average slowdown (not<br />
speedup) <strong>of</strong> 4x [1]. By using warp processors, we can improve performance and energy<br />
efficiency for <strong>the</strong> embedded applications. Warp processors are good for embedded<br />
systems that execute <strong>the</strong> same application for extended periods repeatedly and good for<br />
systems which s<strong>of</strong>tware updates and backward compatibility are essential. These kinds<br />
<strong>of</strong> processors are extremely useful and efficient for data-intensive applications such as<br />
image/video processing, scientific research or even games.<br />
2.2 Dynamic CAD<br />
FPGA CAD tasks, shown in Figure 2 include:<br />
102
Warp processing<br />
-Decompilation<br />
-Behavioral syn<strong>the</strong>sis: converting a control/data flow graph to a data path and register<br />
transfers<br />
-Register transfer syn<strong>the</strong>sis: converting register transfers to logic<br />
-Logic syn<strong>the</strong>sis: minimizing logic<br />
-Technology mapping: mapping logic to FPGA compatible resources<br />
-Placement: placing logic/computer resource within specific FPGA resources<br />
-Routing: creating connections between logic/computer resources<br />
The traditional desktop counterparts which do <strong>the</strong> same tasks have long execution<br />
time between minutes to hours, large memory resources, sometimes even more than 50<br />
megabytes, and large code size that can require hundreds to thousands <strong>of</strong> lines <strong>of</strong> source<br />
code. However CAD algorithm must provide very fast execution times when using small<br />
instruction and data memory resources and also minimizing <strong>the</strong> data amount <strong>of</strong> memory<br />
used during execution and providing excellent result. Our on-chip CAD tool starts with<br />
<strong>the</strong> s<strong>of</strong>tware binary and in decompilation step converts <strong>the</strong> s<strong>of</strong>tware loops into high-level<br />
representation that are more suitable for syn<strong>the</strong>sis. At this point each assembly instruction<br />
converted to equivalent register transfers, which provides a binary independent representation<br />
instruction set. Decompilation tool builds a control/flow graph for <strong>the</strong> s<strong>of</strong>tware<br />
region after converting <strong>the</strong> instructions into register transfers and <strong>the</strong>n generate a data<br />
flow graph by parsing <strong>the</strong> semantic strings for each register transfer. The parser uses<br />
definition-use and use-definition analysis to build data flow graph by combining each<br />
register transfer trees. While control and data flow graph generated, decompilation uses<br />
standard compiler optimizations in order to remove <strong>the</strong> overhead introduced by <strong>the</strong> assembly<br />
code and instruction set. The next step is to recover high-level constructs such<br />
as loops and if statements after recovering control/data flow graph. After all <strong>the</strong>se steps,<br />
on-chip CAD tools performs partitioning to decide which <strong>of</strong> <strong>the</strong> critical s<strong>of</strong>tware kernels<br />
introduced by on-chip pr<strong>of</strong>iler are most suitable for implementing in hardware, to maximize<br />
speedup while reducing energy. In behavioral and register transfer syn<strong>the</strong>sis our<br />
dynamic CAD converts <strong>the</strong> control/data flow graph <strong>of</strong> each critical kernel into a hardware<br />
circuit description. The next job is to execute logic syn<strong>the</strong>sis to optimize <strong>the</strong> hardware<br />
circuit.The core <strong>of</strong> logic syn<strong>the</strong>sis algorithm is an efficient 2-level logic minimizer that is<br />
15x faster and uses 3x less memory that Espresso-II. The trade <strong>of</strong>f here is a two percent<br />
increase in circuit size [1].<br />
After this step, CAD tool performs technology mapping to map <strong>the</strong> hardware circuit<br />
onto configurable logic blocks (CLBs) and lookup tables (LUTs) <strong>of</strong> <strong>the</strong> configurable logic<br />
fabric. Our technology mapper uses a hierarchical bottom up graph-clustering algorithm.<br />
After mapping <strong>the</strong> hardware circuit into a network <strong>of</strong> CLBs, <strong>the</strong> on-chip CAD tool places<br />
<strong>the</strong> CLB nodes onto <strong>the</strong> configurable logic. The most compute and memory intense FPGA<br />
CAD task is routing. Typically <strong>the</strong> tool reroute a circuit many times till <strong>the</strong> time that<br />
tool finds a valid or sufficiently optimized rating. This requires large amount <strong>of</strong> memory<br />
for updating and restoring routing resource graph and long execution times. We reduced<br />
execution time and memory use by developing a fast lean routing algorithm and designing<br />
a CAD oriented FPGA fabric [4].<br />
103
Maryam Sanati<br />
2.3 Warp processing scenarios<br />
Figure 3: Warp processing scenarios<br />
There are two different scenarios depending on application runtime. In figure 3a we<br />
can see <strong>the</strong> execution <strong>of</strong> short-running application. In this case, running dynamic CAD<br />
tools take more time than <strong>the</strong> application. Here for <strong>the</strong> first few executions <strong>the</strong>re is no<br />
speedup with warp processing, but it can also benefit from warp processing as long as <strong>the</strong><br />
warp processor can remember <strong>the</strong> application’s hardware configuration. Figure 3b depicts<br />
longer-running applications require hours or even days for warp processing like scientific<br />
computing. In this case, pr<strong>of</strong>iling and dynamic CAD finish some time before <strong>the</strong> end <strong>of</strong><br />
first execution <strong>of</strong> <strong>the</strong> application and <strong>the</strong> rest <strong>of</strong> <strong>the</strong> application can benefit from warped<br />
execution. Therefore, <strong>the</strong> difference between <strong>the</strong>se 2 scenarios is that in <strong>the</strong> short-running<br />
application, <strong>the</strong>y will be mapped after several executions by saving and <strong>the</strong>n reusing <strong>the</strong><br />
application’s saved FPGA configuration. However, in longer-running applications, <strong>the</strong>y<br />
can be warped even during a single execution and <strong>the</strong>re is no need for saving <strong>of</strong> <strong>the</strong> FPGA<br />
configuration although <strong>the</strong> application can still use saved configuration for its future executions.<br />
3 Single-threaded Application<br />
In each program <strong>the</strong>re are many paths <strong>of</strong> execution. The programs with only one path<br />
<strong>of</strong> execution called single threaded program and <strong>the</strong> ones that have two or more paths<br />
called multi-threaded programs. Each single threaded program has <strong>the</strong> ability to execute<br />
only one task at a time and should finish each task in a sequence before starting <strong>the</strong> next<br />
one. According to different demands and needs, sometimes single threaded programs are<br />
working properly, however asking to accomplish multiple simultaneous tasks sometimes<br />
lead you to use multiple threads.<br />
Thread warping can improve performance <strong>of</strong> a multiprocessor by speeding up individual<br />
thread and concurrent execution <strong>of</strong> more threads.<br />
104
Warp processing<br />
We followed and worked on <strong>the</strong> result <strong>of</strong> many experiments on single-threaded benchmark<br />
applications. Warp processing would not provide speedup for all <strong>of</strong> <strong>the</strong>m, <strong>the</strong>refore<br />
we only consider <strong>the</strong> ones which amenable to speedup using FPGAs. For o<strong>the</strong>rs we need<br />
to rewrite <strong>the</strong>m or develop new decompilation techniques, on <strong>the</strong> o<strong>the</strong>r hand, warp processing<br />
can not result in slow down. If it can not speed up <strong>the</strong> application, <strong>the</strong> binary<br />
updater lets <strong>the</strong> binary to execute on <strong>the</strong> microprocessor alone. Our present warp FPGA<br />
fabric supports approximately 50000 equivalent logic gates, roughly equal in logic capacity<br />
to a small Xilinx Spartan3 FPGA [1].<br />
The communication between <strong>the</strong> microprocessor and <strong>the</strong> FPGA is implemented in<br />
<strong>the</strong> current architecture using <strong>the</strong> combination <strong>of</strong> shared memory, memory-mapped communication,<br />
and interrupts. Digital signal processors (DSPs) use data address generators.<br />
The FPGA uses <strong>the</strong> same to stream data required by FPGA circuit from memory.<br />
The microprocessor uses interrupts I order to be aware <strong>of</strong> hardware completion and uses<br />
memory-mapped communication to initialize and enable <strong>the</strong> FPGA. We need at least one<br />
cycle and at most two cycles for single data transfer between microprocessor and FPGA.<br />
Comparing DSP to warp processor shows that DSP uses arithmetic-level parallelism<br />
to improve performance like warp processing, but warp processing is usually faster while<br />
<strong>the</strong>re are some benchmarks that DSP is a little faster for. DSP can execute only several<br />
operations in parallel, while warp processing support wider range <strong>of</strong> parallelism. The<br />
cases that have little parallelism are faster on DSPs because <strong>of</strong> its faster clock frequency.<br />
4 Multi-threaded Applications<br />
Thread warping is a dynamic optimization technique that uses a single processor on a<br />
multiprocessor system to dynamically syn<strong>the</strong>size threads into custom accelerator circuits<br />
on FPGA. In o<strong>the</strong>r words warp processing in modern processing architecture, multi core<br />
devices are connected on boards or backplanes in order to make large multiprocessor<br />
systems. In single thread, <strong>the</strong> program contains only one execution sequence, but <strong>the</strong>re<br />
can be more execution paths as well. Therefore <strong>the</strong> first step is to create threads to execute<br />
function f().In <strong>the</strong> case that <strong>the</strong> number <strong>of</strong> processors are not enough for <strong>the</strong> number <strong>of</strong><br />
threads (step 1), OS will put <strong>the</strong>m in a queue to wait for a processor to be available<br />
(step 2). Our framework is responsible for analyzing <strong>the</strong> waited threads and utilize <strong>the</strong><br />
on-chip CAD tools as it creates accelerator circuits for f() (step 3). CAD tools create<br />
custom accelerator circuits for <strong>the</strong> f() function. It takes 32 minutes for <strong>the</strong> CAD finishes<br />
mapping <strong>the</strong> accelerators onto <strong>the</strong> FPGA. If <strong>the</strong> application has not finished yet, operating<br />
system scheduled threads to accelerators and microprocessors, using thread-level and finegrained<br />
parallelism.<br />
Thread warping hides <strong>the</strong> FPGA by dynamically syn<strong>the</strong>sizing accelerators, allowing<br />
s<strong>of</strong>tware developers to take advantage <strong>of</strong> <strong>the</strong> performance improvements <strong>of</strong> custom circuits<br />
without any changes to <strong>the</strong> tool flow. Just as multi thread programs make use <strong>of</strong><br />
more processors without rewriting or recompiling code [3]. During execution at different<br />
points, thread warping is able to create different accelerator versions according to <strong>the</strong><br />
105
Maryam Sanati<br />
Figure 4: (a) On-chip CAD tool flow , (b) accelerator syn<strong>the</strong>sis tool flow<br />
available amount <strong>of</strong> FPGA.<br />
4.1 On-chip CAD tools<br />
Figure 4 shows on-chip CAD tool flow, which first analyzes <strong>the</strong> thread queue and creates<br />
custom accelerators for waiting threads using accelerator syn<strong>the</strong>sis tool flow. We need<br />
to define some terms first. A thread creator is a function that contains application programming<br />
interface (API) call that creates threads. A thread is <strong>the</strong> unit <strong>of</strong> execution that<br />
operating system schedules. A thread group is a collection <strong>of</strong> threads that created from<br />
some instruction address that share input data. A thread function is a function that a thread<br />
executes.<br />
As we can see in figure 4a queue analysis determines <strong>the</strong> union <strong>of</strong> waiting thread<br />
functions and thread counts shows <strong>the</strong> occurrences <strong>of</strong> each thread function in <strong>the</strong> queue.<br />
Then if <strong>the</strong> accelerator has not been created before, accelerator syn<strong>the</strong>sis creates a custom<br />
circuit for each thread function and put it in accelerator library. There is update s<strong>of</strong>tware<br />
binary which used to communicate between microprocessor and accelerator created by<br />
accelerator syn<strong>the</strong>sis. Specifying <strong>the</strong> number <strong>of</strong> accelerators to place in <strong>the</strong> FPGA for<br />
each thread function is <strong>the</strong> accelerator instantiation responsibility. The output <strong>of</strong> this step<br />
is converted to an FPGA bitstream by place and rout tool. Schedulable resource list (SRL)<br />
has <strong>the</strong> list <strong>of</strong> available processing resources in order to inform <strong>the</strong> operating system about<br />
<strong>the</strong> available processing resources. The thread queue has a limited size. If <strong>the</strong> number <strong>of</strong><br />
threads reaches <strong>the</strong> predefined size, OS invokes <strong>the</strong> on-chip CAD. As mentioned before<br />
accelerator syn<strong>the</strong>sis creates a new accelerator when new thread function arrives and <strong>the</strong><br />
accelerator <strong>of</strong> that thread does not exist in <strong>the</strong> library. Then because <strong>of</strong> <strong>the</strong> change in<br />
thread counts, accelerator instantiation will change <strong>the</strong> type and amount <strong>of</strong> accelerator in<br />
<strong>the</strong> FPGA.<br />
Figure 4b shows <strong>the</strong> tool flow <strong>of</strong> accelerator syn<strong>the</strong>sis, which starts with decompilation<br />
and hardware/s<strong>of</strong>tware partitioning. Then memory access synchronization analyzes<br />
thread function, detects threads with similar memory access patterns, combine <strong>the</strong>m into<br />
thread groups that share memory channels and have synchronized execution. High-level<br />
106
Warp processing<br />
syn<strong>the</strong>sis converts <strong>the</strong> decompiled representation <strong>of</strong> <strong>the</strong> thread function into custom circuit,<br />
represented as a net list. If <strong>the</strong> entire thread function is not implemented on <strong>the</strong><br />
FPGA, <strong>the</strong> binary updater will modify <strong>the</strong> s<strong>of</strong>tware binary in order to communicate with<br />
accelerators.<br />
With parallel access, multiple threads can read <strong>the</strong> same data from memory. Thus,<br />
memory access synchronization (MAS) is able to combine memory access from multiple<br />
accelerators onto a single channel and use a single read to service many accelerators.<br />
MAS unrolls loops to generate fixed-address reads in <strong>the</strong> control/data flow graph <strong>of</strong> each<br />
thread function.<br />
OS gives <strong>the</strong> priority to <strong>the</strong> fastest resource that is compatible with <strong>the</strong> thread function,<br />
which is usually an accelerator. However, in <strong>the</strong> cases that thread functions contain o<strong>the</strong>r<br />
calls (such as create, join, mutexes, or semaphore functions), <strong>the</strong> OS schedules that thread<br />
to microprocessor. There are some cases that no microprocessor or accelerator for <strong>the</strong> first<br />
thread in <strong>the</strong> queue is available, but <strong>the</strong>re may o<strong>the</strong>r threads exist in <strong>the</strong> queue that have<br />
available accelerators. The problem is, when <strong>the</strong> head can not be scheduled, o<strong>the</strong>r threads<br />
in <strong>the</strong> queue can not as well, although <strong>the</strong>y have available accelerators. In order to avoid<br />
this problem, scheduler scans <strong>the</strong> thread queue until finds a thread that can be scheduled.<br />
If <strong>the</strong>re is no resource available or available resources do not apply to any waiting threads,<br />
scheduler can avoid <strong>the</strong> worst case by not scanning <strong>the</strong> queue. The scheduler is invoked<br />
when a thread is created or completed, a lock is released and also when a synchronization<br />
request block a s<strong>of</strong>tware thread.<br />
To evaluate <strong>the</strong> performance <strong>of</strong> <strong>the</strong> framework, <strong>the</strong>y develop a C++ simulator, which<br />
creates a parallel execution graph (PEG). Nodes in this graph represent sequential execution<br />
blocks (SEBs), which are a block that ends with a pthread call, or represent <strong>the</strong><br />
end <strong>of</strong> a thread. Pthread defines a set <strong>of</strong> C programming language types, functions and<br />
constants. Edges <strong>of</strong> this graph show <strong>the</strong> synchronization between SEBs.<br />
5 Conclusion<br />
FPGA can benefit a wide range <strong>of</strong> applications such as video and audio processing, encryption<br />
and decryption, encoding, compression and decompression, bioinformatics and<br />
anything that needs intensive computing and operates on large streams <strong>of</strong> data.We studied<br />
many research and various experiments and showed that <strong>the</strong> basic concept <strong>of</strong> warp<br />
processing, which is <strong>the</strong> concept <strong>of</strong> dynamically mapping s<strong>of</strong>tware kernels to an on-chip<br />
FPGA for improving performance and energy efficiency, is possible. The simplicity <strong>of</strong><br />
<strong>the</strong> W-FPGA’s configuration logic fabric lets us to achieve lower power consumption and<br />
higher execution frequencies compared to a traditional FPGA for <strong>the</strong> application considered.<br />
Warp processing benefits were most apparent for application with much concurrency.<br />
In multi-thread warping we need additional CAD tools that can determine which and how<br />
many threads to syn<strong>the</strong>size.<br />
107
Maryam Sanati<br />
References<br />
[1] Frank Vahid Greg Stitt Roman Lysecky, Warp Processing: Dynamic Translation <strong>of</strong><br />
Binaries to FPGA Circuits, IEEE Computer Society 2008<br />
[2] Roman Lysecky Greg Stitt Frank Vahid, Warp Processors, ACM Translation on Design<br />
Automation <strong>of</strong> Electronics System, Vol. 11,No. 3, July 2006<br />
[3] Frank Vahid Greg Stitt, Thread Warping: A Framework for Dynamic Syn<strong>the</strong>sis <strong>of</strong><br />
Thread Accelerators, ACM 2007<br />
[4] Frank Vahid Roman Lysecky S. Tan, Dynamic FPGA Routing for Just-in-Time Compilation<br />
, IEEE / ACM 2004<br />
[5] http://www.en.wikipedia.org<br />
[6] http://www.cs.ucr.edu<br />
108
109
Performance Modeling <strong>of</strong> Embedded Applications<br />
with Zero Architectural Knowledge<br />
University <strong>of</strong> Paderborn<br />
Pavithra Rajendran<br />
January 4, 2012<br />
Abstract<br />
Performance evaluation <strong>of</strong> embedded systems is a key phase in <strong>the</strong> design and<br />
development <strong>of</strong> embedded systems. Modern day embedded systems have short<br />
product development life cycle hence it becomes essential to come out with a performance<br />
model which can be done early in <strong>the</strong> design phase so that rework can be<br />
minimized. Most <strong>of</strong> <strong>the</strong> performance estimation techniques require knowledge on<br />
<strong>the</strong> system architecture if it has to be done during design phase, unfortunately not<br />
all target architecture information is available early in <strong>the</strong> design phase.<br />
Objective <strong>of</strong> <strong>the</strong> paper is to present a model done by Marco Lattuada And Fabrizio<br />
Ferrandi that estimates performance without requiring any information on <strong>the</strong><br />
processor architecture except GNU GCC intermediate representation and compare<br />
it against o<strong>the</strong>r similar model. The model will use linear regression technique on<br />
internal register level representation <strong>of</strong> GNU GCC compiler so that compiler optimization<br />
is exploited. The paper also describes briefly on my ideas how <strong>the</strong> model<br />
can extended to evaluate performance <strong>of</strong> modern day embedded systems that are<br />
highly complex with advanced architectures like branching, pipelining, streaming,<br />
buffer cache and power management which cannot be efficiently derived based on<br />
linear methods.<br />
1 INTRODUCTION<br />
The concept <strong>of</strong> early performance evaluation in design and minimal architectural dependency<br />
are primary criteria for modern day embedded systems. Flexibility, time-tomarket<br />
and cost requirements form integral part in development cycle and this can be<br />
only achieved by early performance evaluation. Fixing time related constraints later in<br />
development cycle will cost more as it may cause rework in design and development.<br />
This complexity demands a new model that can evaluate performance with least architectural<br />
knowledge. Increased use <strong>of</strong> Multi-Processor System on Chip (MPSoC) in<br />
embedded systems has increased complexity <strong>of</strong> evaluation due to multiple components<br />
and its heterogeneity that demands architectural knowledge. This means performance<br />
estimation is done in early design phase so that alternate solutions can be compared<br />
110
University <strong>of</strong> Paderborn<br />
Pavithra Rajendran<br />
without actually knowing all <strong>the</strong> details <strong>of</strong> <strong>the</strong> components that will be used later in<br />
product development. Similar works results show that early evaluation technique [5]<br />
aptly fits <strong>the</strong> modern day time-to-market pressure, short life <strong>of</strong> <strong>the</strong> product to fit market<br />
competition. But modern day embedded systems are real-time and more complex. For<br />
example modern real-time embedded systems have multimedia application which has<br />
to encode/decode a stream with high speed, while at <strong>the</strong> same not compromising on <strong>the</strong><br />
quality. Performance with quality is <strong>the</strong> key for time critical embedded systems. Monitoring<br />
device used in Nuclear power plants or devices to monitor forest fire, missing<br />
a deadline or time critical decision will cause severe damage. Moreover <strong>the</strong>se modern<br />
days systems are developed with huge market competition to produce at low cost. They<br />
have to be reliable but at <strong>the</strong> same time show competitive performance. The proposed<br />
methodology does not require any knowledge on target processor but <strong>the</strong> system design<br />
exploits <strong>the</strong> information provided by GNU GCC compiler about <strong>the</strong> target processor.<br />
Reminder <strong>of</strong> this paper is indexed as below. Section 2 compares <strong>the</strong> o<strong>the</strong>r similar works.<br />
Section 3 describes <strong>the</strong> proposed methodology by Lattuda. Section 4 gives an comparision<br />
<strong>of</strong> experimental results <strong>of</strong> similar models. Section 5 describes enhancements that<br />
can be done to <strong>the</strong> methodology for modern day embedded system. Section 6 concludes<br />
<strong>the</strong> paper.<br />
2 COMPARISON OF RELATED WORK<br />
Generic methods to do performance evaluation can be categorized as<br />
1. Direct measures.<br />
2. Estimation by simulation.<br />
3. Estimation using ma<strong>the</strong>matical model.<br />
4. Prediction.<br />
Most <strong>of</strong> <strong>the</strong> time direct measures need developers to know accurate knowledge <strong>of</strong><br />
<strong>the</strong> target architecture to do performance evaluation. This is not possible because not all<br />
components will be available early in design phase and most <strong>of</strong> <strong>the</strong> time <strong>the</strong> components<br />
are prone to change later in design due to cost, new technology in chip or o<strong>the</strong>r factors.<br />
So this model can not be fully utilized early in design phase. Hence techniques based<br />
on simulation are preferred. In simulation methodology each and every component can<br />
be simulated by running behavior simulator model using MATLAB or Neural network.<br />
The advantage <strong>of</strong> simulation model is its accuracy, at <strong>the</strong> same time it can be applied<br />
for smaller components only and cannot be generalized for a bigger set since simulation<br />
behavior could change. This disadvantage leads to <strong>the</strong> third model based on ma<strong>the</strong>matical<br />
model. In this model estimation can be derived by correlating numerical functions<br />
against performance <strong>of</strong> <strong>the</strong> component. This model is less accurate but at <strong>the</strong> same time<br />
much faster. Prediction model can be based on simulation results or pr<strong>of</strong>ile study. The<br />
simulation based predictive model retains <strong>the</strong> limitations <strong>of</strong> <strong>the</strong> simulation model while<br />
<strong>the</strong> pr<strong>of</strong>ile based study needs <strong>the</strong> designer to know architecture <strong>of</strong> <strong>the</strong> target system.<br />
111
Performance Modeling <strong>of</strong> Embedded Applications with Zero Architectural Knowledge<br />
2.1 Direct Estimation Model<br />
Direct measures to do performance evaluation require deep knowledge in architectural<br />
characteristics <strong>of</strong> <strong>the</strong> target system to be designed.<br />
Brandolese et al. [1] presented a model that divided <strong>the</strong> source code structure into<br />
basic elements called atoms which are used in hierarchical analysis <strong>of</strong> <strong>the</strong> performance.<br />
In this model, performance evaluation is calculated by summing <strong>the</strong> execution time<br />
taken for all <strong>the</strong> atoms plus different overhead scenarios in <strong>the</strong> system. Execution time <strong>of</strong><br />
each atom depends on time taken to execute a particular program path in ideal condition<br />
plus a deviation factor derived from ma<strong>the</strong>matical model. Disadvantage <strong>of</strong> this model<br />
is reference time, deviations could not be linearly mapped, increased complexity to<br />
estimate <strong>the</strong> execution time and its deviation for a larger system. Also this model could<br />
not consider <strong>the</strong> target architecture characteristics like parallelism, memory etc.<br />
To overcome this disadvantage Beltrame, Brandolese et al. [4] came out with a subsequent<br />
more flexible model. This model derives performance estimation by summing<br />
<strong>the</strong> execution delay <strong>of</strong> an operation plus overhead due to deviations and a co-efficient<br />
factor that considers target system performance characteristics like parallelism. Problem<br />
with <strong>the</strong> model is that it does not consider <strong>the</strong> heterogeneity <strong>of</strong> <strong>the</strong> target system which<br />
can potentially use mutli-processors.<br />
Hwang et al. [12] came out with a model which considers pipeline, branch delay<br />
and memory organization but this still requires exact timings for executing different<br />
basic blocks in different processors.<br />
Most direct estimation techniques posses <strong>the</strong> same disadvantage: <strong>the</strong>y require <strong>the</strong><br />
designer to have some knowledge <strong>of</strong> <strong>the</strong> architecture <strong>of</strong> <strong>the</strong> target component to guarantee<br />
accuracy. This requirement was affordable when <strong>the</strong> designer dealt with a single<br />
or few processing elements but with MPoCs in modern day real-time embedded system,<br />
this is no-more a realistic approach.<br />
2.2 Simulation and Ma<strong>the</strong>matical Model<br />
Performance techniques based on automation like simulation or ma<strong>the</strong>matical models<br />
are faster and more accurate than direct estimation. They can easily apply multiprocessor<br />
characteristic to figure performance evaluation on memory access and parallelism.<br />
Question is how much degree <strong>of</strong> target system architecture should be known by <strong>the</strong><br />
designer.<br />
Lajola et al. [6] used ma<strong>the</strong>matical model with <strong>the</strong> GNU GCC compiler to generate<br />
assembler level C code with timing annotations. This can be used for providing very<br />
accurate and fast estimation. Disadvantage <strong>of</strong> <strong>the</strong> model is that regenerating C code in<br />
target system requires understanding <strong>the</strong> target architecture or at least <strong>the</strong> instruction<br />
sets <strong>of</strong> <strong>the</strong> target processor.<br />
Oyamada et al. [7] comes with a simulation based model, that is based on instruction<br />
set <strong>of</strong> target processor but follows a non-linear approach based on neural network. Using<br />
neural network simulation makes <strong>the</strong> model more accurate and faster but it makes <strong>the</strong><br />
estimation complex if developer wants to break <strong>the</strong> code into subparts.<br />
112
University <strong>of</strong> Paderborn<br />
Pavithra Rajendran<br />
2.3 Prediction Model<br />
Prediction technique is also used in performance estimation. Suzuki et al. [10] used a<br />
prediction which a considers set <strong>of</strong> benchmark execution time and average cycle count to<br />
determine <strong>the</strong> performance <strong>of</strong> <strong>the</strong> system. The drawback <strong>of</strong> this model is that it does not<br />
consider overheads, loops or recursion. Giusto et al. [9] came out with a similar model<br />
but with a linear approach which can be applied to similar application execution path<br />
without even estimating. Moreover <strong>the</strong> entire prediction model does not consider <strong>the</strong><br />
architectural features such as parallelism, pipelining, compiler optimization, etc. Above<br />
all <strong>the</strong>y lack accuracy when randomly applied across different processors.<br />
In Summary,<br />
1. Direct evaluation Model : Cannot be effectively used as most <strong>of</strong> <strong>the</strong> target components<br />
wont be available during performance evaluation design phase.<br />
2. Simulation Model : Will require knowledge about target architecture for accuracy.<br />
3. Ma<strong>the</strong>matical Model : Linear and additive in nature but deviations are higher.<br />
4. Prediction Model : Lack <strong>of</strong> Accuracy.<br />
3 PROPOSED METHOLOGY - Marco Lattuada and<br />
Fabrizio Ferrandi [2]<br />
Comparison study between all <strong>the</strong> similar work show that it is necessary to come out<br />
with a performance estimation model which<br />
(a) Should consider <strong>the</strong> possible characteristics <strong>of</strong> <strong>the</strong> target processors, but without<br />
requiring to know architecture itself nor <strong>of</strong> its Instruction Set hence extensible.<br />
(b) Should consider target architecture characteristics like compile-time optimizations,<br />
pipelining, parallelism etc.<br />
(c) Should be linear so that every component can be analyzed individually.<br />
(d) Should take into account <strong>the</strong> dynamic behavior <strong>of</strong> <strong>the</strong> application to find correlation<br />
among source code, input data and performance.<br />
3.1 Linear Regression Technique<br />
Linear regression form in ma<strong>the</strong>matical notation is <strong>of</strong> <strong>the</strong> form:<br />
Y = f(X, β, ɛ) (1)<br />
Where Y is <strong>the</strong> execution time <strong>of</strong> <strong>the</strong> model or subset <strong>of</strong> <strong>the</strong> model that depends on<br />
X which is <strong>the</strong> dependent source code parameters, β is <strong>the</strong> input for those parameters<br />
and ɛ is <strong>the</strong> error co-efficient.<br />
113
Performance Modeling <strong>of</strong> Embedded Applications with Zero Architectural Knowledge<br />
Expanding <strong>the</strong> function, it can be written as<br />
This can be simplified as below<br />
Y = β0 + β1X1 + β2X2 + βkXk + ... + ɛ (2)<br />
Executiontime/Cycletime = β0 + �<br />
βi.xi<br />
Linear regression technique can be divided into two steps: model building and model<br />
application. During model building we set bench mark execution time and develop and<br />
tune <strong>the</strong> characters which we can call it as training sets as denoted in simulation model.<br />
This is usually done by running IPROF on <strong>the</strong> target system or by generating neural<br />
networks or simulators in MATLAB or similar simulation tools. During <strong>the</strong> latter step,<br />
we apply <strong>the</strong> analyzed factor over ano<strong>the</strong>r subset <strong>of</strong> <strong>the</strong> application and derive at <strong>the</strong><br />
execution time directly.<br />
3.2 Model Description<br />
The proposed model basically consists <strong>of</strong> following major components:<br />
1. Converts source code in a language independent intermediate representation called<br />
GIMPLE<br />
2. Performs <strong>the</strong> target independent optimizations.<br />
3. Translates <strong>the</strong> GIMPLE representation into <strong>the</strong> RTL (Register Transfer language)<br />
representation.<br />
4. Performs <strong>the</strong> target dependent optimizations .<br />
5. Converts RTL representation into assembly language.<br />
Each RTL instruction is composed <strong>of</strong> a combination <strong>of</strong> RTL operations: an RTL<br />
operation is mainly characterized by an operator (e.g., plus, minus), a data type (e.g.,<br />
SI Single Precision Integer), some operands (e.g., registers, results <strong>of</strong> o<strong>the</strong>r RTL operations)<br />
and annotations.<br />
For example as illustrated in <strong>the</strong> figure 2 and figure 3 , RTL instruction is composed<br />
<strong>of</strong> a set operation which writes in a register (reg) <strong>the</strong> result <strong>of</strong> a PLUS operation executed<br />
on a register and on a constant integer.<br />
The RTL sequences based analysis meets <strong>the</strong> requirements listed in a previous section<br />
for <strong>the</strong> following reasons:<br />
1. RTL representations <strong>of</strong> <strong>the</strong> same application are different for different target processor.<br />
This is because regenerating source code from GNU GCC compiler code<br />
considers <strong>the</strong> characteristics <strong>of</strong> target architecture hence it considers target architecture<br />
performance characters like compiler optimization, pipelining and memory<br />
hierarchy.<br />
114<br />
i∈F<br />
(3)
University <strong>of</strong> Paderborn<br />
Pavithra Rajendran<br />
Figure 1: Lattuda and Ferrandi’s Model<br />
2. The language is target independent: source code can be generated from assembly<br />
code on any target processor system.<br />
3. Target-independent optimizations have already been performed because code is generated<br />
after middle end compilation.<br />
4. Portion <strong>of</strong> target application can analyzed independently.<br />
115
Performance Modeling <strong>of</strong> Embedded Applications with Zero Architectural Knowledge<br />
Figure 2: C Code and GIMPLE<br />
5. Pr<strong>of</strong>iling can be done on target machine and can be coupled with <strong>the</strong> RTL representation.<br />
3.3 Model Building<br />
Proposed model consists <strong>of</strong> three preprocessing steps: Normalization, Main introduction<br />
and Clustering, that are done before linear regression.<br />
Normalization is applied for accuracy. Usually estimation techniques consider<br />
overall execution delay without considering nei<strong>the</strong>r magnitude <strong>of</strong> <strong>the</strong> input nor <strong>the</strong> size<br />
<strong>of</strong> application. Absolute error or deviation cannot provide accurate information hence<br />
relative error must be considered. This is achieved through normalization in <strong>the</strong> proposed<br />
model where:<br />
Input : For each RTL sequence class, <strong>the</strong> fraction <strong>of</strong> <strong>the</strong> sequences <strong>of</strong> <strong>the</strong> application<br />
which belong to that class when compared to whole application<br />
Output : The average number <strong>of</strong> cycles required by an RTL sequence <strong>of</strong> that application,<br />
<strong>the</strong> range <strong>of</strong> this new dependent variable is less sensible than <strong>the</strong> original one.<br />
These can be easily calculated by dividing <strong>the</strong> number <strong>of</strong> sequence occurrence to<br />
overall count. For example <strong>the</strong> normalization <strong>of</strong> operation ashift:SI-plus:SI is 0.09<br />
which is obtained by dividing its occurrence by overall count <strong>of</strong> sequences that is 1/11<br />
which is 0.09.<br />
Normally simulation does not consider <strong>the</strong> startup time <strong>of</strong> <strong>the</strong> application itself or<br />
function call overhead. This can be compromised in <strong>the</strong> model by introducing a fake<br />
operation called Main introduction. This can be considered as a constant.<br />
Last comes <strong>the</strong> clustering where we group similar RTL sequences. In a large application,<br />
<strong>the</strong>re might be millions <strong>of</strong> RTL sequences. The number <strong>of</strong> RTL sequences can<br />
be minimized by co-relating an equivalence relation among < op : type > classes. This<br />
relation should describe which operations can be considered performance-equivalent.<br />
For example plus and minus, less than and greater than, same operation on similar type<br />
116
University <strong>of</strong> Paderborn<br />
Pavithra Rajendran<br />
Figure 3: RTL Representation and Assemble Language<br />
<strong>of</strong> data should posses same execution time. This will reduce <strong>the</strong> number <strong>of</strong> training sets<br />
and hence <strong>the</strong> model becomes simplified.<br />
3.4 Model Application<br />
Once <strong>the</strong> analysis and model building is done, <strong>the</strong>n <strong>the</strong> linear formula explained in<br />
section 4.1 is applied. Basic execution cycle time is calculated first and repeated cycles<br />
are executed to calculate <strong>the</strong> deviations<br />
4 COMPARISON OF EXPERIMENTAL RESULTS<br />
The RTL methodology proposed exploits linear regression technique when compared to<br />
<strong>the</strong> o<strong>the</strong>r models in <strong>the</strong> section 2.<br />
1. It is more accurate on heterogeneous system than [9] as it converts source code to<br />
RTL form only and regenerates assembler code irrespective <strong>of</strong> target architecture.<br />
117
Performance Modeling <strong>of</strong> Embedded Applications with Zero Architectural Knowledge<br />
RTL also makes use <strong>of</strong> <strong>the</strong> target architecture compiler optimization features while<br />
regenerating source code.<br />
2. The average error deviation obtained by models [10] is 6.03 % which is <strong>the</strong> lowest<br />
when compared to o<strong>the</strong>r model but it can be only applied for simpler applications<br />
without loops, recursion etc.<br />
3. Most linear model described in section 2 exhibits an error ranging 0.06% to 19.3<br />
% and non linear model ranges 0.03% to 20.5% . The deviation is minimal if <strong>the</strong><br />
architecture is known and input data is unknown. But in RTL linear model error does<br />
not depend on architecture and show 8.6 % deviation in a worst case scenario.<br />
4. Lajolos model [6] exhibits <strong>the</strong> least deviation with less than 4 % but <strong>the</strong> system requires<br />
architectural knowledge to regenerate <strong>the</strong> code and cycle iteration is minimal.<br />
5. Oyamada et al. [7] successfully created a similar model that produced almost same<br />
result around 10.8% in worst case scenario. The model works perfectly in heterogeneous<br />
systems but it largely uses neural network to train <strong>the</strong> sets and hence this<br />
model is non-linear and simple to extend when compared to <strong>the</strong> RTL model that uses<br />
clustering.<br />
6. All models based on assembly level code show better result than RTL and are more<br />
accurate but <strong>the</strong>y require <strong>the</strong> developer to know <strong>the</strong> instruction set <strong>of</strong> <strong>the</strong> target processor.<br />
5 PROPOSED FUTURE WORK<br />
Lattuadas work which wat was reviewed above considers certain features like linear<br />
regression technique with early evaluation during design phase. But it does not consider<br />
evaluation <strong>of</strong> modern day embedded systems which may result in complex and millions<br />
<strong>of</strong> sequences if we go with RTL sequence model. This will take ages if we do not have<br />
AI neural network to create training sets. Hence it will start tilting towards non-linear<br />
approach for complex systems.<br />
Major drawbacks <strong>of</strong> Lattuadas Model are<br />
1. Does not consider <strong>the</strong> length <strong>of</strong> sequences created by RTL.<br />
2. Clustering becomes complex for large applications.<br />
3. No automated clustering.<br />
C/C++ based models [8] can be executed to simulate <strong>the</strong> complete behavior <strong>of</strong> a<br />
system, and obtain some performance information. Just like testing, <strong>the</strong>se approaches<br />
can give good confidence in <strong>the</strong> correctness <strong>of</strong> <strong>the</strong> system, but no formal guarantees on<br />
<strong>the</strong> upper limits <strong>of</strong> performance. Abstract interpretation models can be used to verify<br />
formally and automatically <strong>the</strong> properties like <strong>the</strong> system never takes more than X units<br />
<strong>of</strong> time to process an event. These analyses provide formal guarantees but analysis can<br />
118
University <strong>of</strong> Paderborn<br />
Pavithra Rajendran<br />
Figure 4: Proposed Model<br />
take huge amount <strong>of</strong> time and memory. The approach should be to opt for a model that<br />
can analyze <strong>the</strong> critical components in detail using modular approach [11] [3] and less<br />
critical components using abstract translation technique but at <strong>the</strong> same time easy to<br />
119
Performance Modeling <strong>of</strong> Embedded Applications with Zero Architectural Knowledge<br />
create training sets. Above model can be extended and represented as in figure 4.These<br />
are <strong>the</strong> ideal characteristic steps that are needed for a fast and portable performance<br />
analsysis which needs zero architectural knowledge <strong>of</strong> <strong>the</strong> target systems.<br />
1. Convert Source code into machine independent virtual code.<br />
2. Cluster <strong>the</strong> operations using neural network.<br />
3. Regenerate code using target architecture.<br />
4. Execute performance estimation cycle using trained neural network.<br />
5. Apply <strong>the</strong> deviation co-efficient using dynamic programming.<br />
6. Apply backtracking algorithm to decide which execution path must be applied while<br />
estimating real-time applications.<br />
6 CONCLUSION<br />
Early performance estimation is <strong>the</strong> way to go due to <strong>the</strong> complexity and heterogeneity<br />
<strong>of</strong> <strong>the</strong> current and future embedded systems. Todays market scenario require comparing<br />
multiple architecture during design time hence fast and accurate performance estimation<br />
tools are needed to help <strong>the</strong> design architecture exploration. This proposed future<br />
work is an integrated methodology for faster estimation without architectural knowledge<br />
supported by neural networks. The estimator provides flexibility and precision even for<br />
complex processors, with pipeline and cache memories. The estimator is fast compared<br />
to o<strong>the</strong>r linear models and better than non-linear models in worst case scenario.<br />
References<br />
[1] F. Salice C. Brandolese, W. Fornaciari and D. Sciuto. Source-level execution time<br />
estimation <strong>of</strong> c programs. pages 98103, 2001.<br />
[2] Marco Lattuada Fabrizio Ferrandi Politecnico di Milano. Performance modeling<br />
<strong>of</strong> embedded applications with zero architectural knowledge. pages 277286, New<br />
York, NY, USA, 2010. ACM, 2010.<br />
[3] C. Pilato F. Ferrandi, M. Lattuada and A. Tumeo. Performance estimation for task<br />
graphs combining sequential path pr<strong>of</strong>iling and control dependence regions. In<br />
MEMOCODE09: <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 7th IEEE/ACM international conference on<br />
Formal Methods and Models for <strong>Codesign</strong>, pages 131140, 2009.<br />
[4] W. Fornaciari F. Salice D. Sciuto G. Beltrame, C. Brandolese and V. Trianni. Modeling<br />
assembly instruction timing in superscalar architectures. In ISSS 02: 15th<br />
international, 2002.<br />
120
University <strong>of</strong> Paderborn<br />
Pavithra Rajendran<br />
[5] M Gries. Methods for evaluating and covering <strong>the</strong> design space during early design<br />
development. Tech.Rep.UCB/ERL M03/32, Electronics Research Lab, University<br />
<strong>of</strong> California at Berkeley, 2003.<br />
[6] M. Lazarescu M. Lajolo and A. Sangiovanni-Vincentelli. A compilation-based<br />
s<strong>of</strong>tware estimation scheme for hardware/s<strong>of</strong>tware co-simulation. In CODES 99:<br />
<strong>the</strong> Seventh International Workshop on <strong>Hardware</strong>/S<strong>of</strong>tware <strong>Codesign</strong>, 1999, pages<br />
8589, 1999.<br />
[7] F. Zschornack M. S. Oyamada and F. R. Wagner. Applying neural networks to<br />
performance estimation <strong>of</strong> embedded s<strong>of</strong>tware. J. Syst.Archit., 54(1-2):224240,,<br />
2008.<br />
[8] Sangkwon Na Moo-Kyoung Chung and Chong-Min Kyung. ystem-level performance<br />
analysis <strong>of</strong> embedded system using behavioral c/c++ model. IEEE INSPEC<br />
Accession Number: 8540449 , no. 14, pp.188 - 191, 2005.<br />
[9] G. Martin P. Giusto and E. Harcourt. Reliable estimation <strong>of</strong> execution time <strong>of</strong><br />
embedded s<strong>of</strong>tware. In DATE 01: Conference on Design, Automation and Test in<br />
Europe, pages 580589, 2001.<br />
[10] K. Suzuki and A. Sangiovanni-Vincentelli. Efficient s<strong>of</strong>tware performance estimation<br />
methods for hardware/s<strong>of</strong>tware codesign. n DAC 96: 33rd Design Automation<br />
Conference, pages 605610, 1996.<br />
[11] Marcel Verhoef Paul Lieverse Wandeler, Lothar Thiele. System architecture evaluation<br />
using modular performance analysis. Case Study, 2006.<br />
[12] S. Abdi Y. Hwang and D. Gajski. Cycle-approximate retargetable performance<br />
estimation at <strong>the</strong> transaction level. n DATE 08: Conference on Design, Automation<br />
and Test in Europe, pages 38,, 2008.<br />
121
122
Improving Application Launch Times<br />
Gavin Vaz<br />
University <strong>of</strong> Paderborn<br />
gavinvaz@mail.uni-paderborn.de<br />
December, 2 2011<br />
Abstract<br />
Application launch times are very noticable to <strong>the</strong> user. The user has to wait<br />
for <strong>the</strong> entire application to load, and only <strong>the</strong>n can he interact with it. If this wait<br />
is too long, it affects <strong>the</strong> users satisfaction. The primary cause <strong>of</strong> slow application<br />
launch times is attributed to hard disk latencies. This paper looks at how application<br />
launch times can be reduced by predicting when an application might be launched,<br />
and preloading it into main memory in order to reduce disk latencies. It also looks<br />
at how hybrid hard disks could be used to reduce application launch times by around<br />
24%. The paper also takes a look at optimization techniques that are able to reduce<br />
application launch times <strong>of</strong> already fast solid state drives by 28%.<br />
1 Introduction<br />
Application launch times are one <strong>of</strong> <strong>the</strong> most evident performance parameters from <strong>the</strong><br />
users perspective. Waiting for an application to load might not be a very ardent experience,<br />
<strong>the</strong>reby reducing user satisfaction. Over <strong>the</strong> past decade, <strong>the</strong> computational power<br />
<strong>of</strong> processors and <strong>the</strong> speed <strong>of</strong> main memory have been steadily improving. However, <strong>the</strong><br />
size <strong>of</strong> applications have also been growing rapidly. Resulting in slow application launch<br />
times in spite <strong>of</strong> having faster processors and memory.<br />
Youngjin Joo et al. [9] performed a study on application launch times in order to<br />
determine how much time was used by <strong>the</strong> CPU, <strong>the</strong> memory, <strong>the</strong> hard disk drive(HDD),<br />
and for data transfer during an application launch. Their study (See Fig. 1) showed that<br />
<strong>the</strong> CPU and memory constituted for merely 20 to 30 percent <strong>of</strong> <strong>the</strong> application launch<br />
time, <strong>the</strong> remaining time was accounted for by <strong>the</strong> HDD, with disk rotational latency and<br />
seek times accounting for nearly half <strong>of</strong> <strong>the</strong> total application launch time.<br />
HDDs are block devices. A block being <strong>the</strong> smallest addressable unit <strong>of</strong> a HDD and<br />
addressed by using its logical block address (LBA). In order to read a block, <strong>the</strong> HDD<br />
controller must first move <strong>the</strong> head into position over <strong>the</strong> appropriate cylinder; <strong>the</strong> time<br />
taken to do this is known as <strong>the</strong> seek time <strong>of</strong> <strong>the</strong> disk. The desired disk block might not be<br />
123
Gavin Vaz<br />
Figure 1: Breakdown <strong>of</strong> an applications launch time [9].<br />
below <strong>the</strong> head, and so <strong>the</strong> HDD controller must wait for <strong>the</strong> disk to rotate until <strong>the</strong> desired<br />
block is under <strong>the</strong> head; this time is known as <strong>the</strong> rotational latency <strong>of</strong> <strong>the</strong> disk. Thus,<br />
seek time and rotational latency toge<strong>the</strong>r constitute <strong>the</strong> disk latency and are <strong>the</strong> outcome<br />
<strong>of</strong> mechanical limitations <strong>of</strong> a HDD.<br />
An application(file) is made up <strong>of</strong> many such blocks, which might not be contagious,<br />
and in reality might be distributed across many cylinders <strong>of</strong> <strong>the</strong> HDD. In addition to<br />
this, most <strong>of</strong> <strong>the</strong> applications nowadays use shared libraries which also need to be loaded<br />
from <strong>the</strong> disk when <strong>the</strong> application is launched. And so when an application is launched,<br />
hundreds <strong>of</strong> blocks are requested from <strong>the</strong> HDD resulting in a lot <strong>of</strong> time being wasted<br />
just because <strong>of</strong> seek and rotational latencies. Hence, seek and rotational latencies are <strong>the</strong><br />
primary and most important cause <strong>of</strong> slow application launches.<br />
The problem <strong>of</strong> disk latencies have been addressed by many techniques, one <strong>of</strong> <strong>the</strong>m<br />
being disk caches. Disk caches are effective only when <strong>the</strong> same data is requested(accessed)<br />
repeatedly, <strong>the</strong>reby eliminating seek and rotational latencies by reading <strong>the</strong> data directly<br />
from <strong>the</strong> cache instead <strong>of</strong> from <strong>the</strong> disk. However, in <strong>the</strong> case <strong>of</strong> an application launch,<br />
unless <strong>the</strong> application has been previously launched, <strong>the</strong> data that <strong>the</strong> application requests<br />
for will not be present in <strong>the</strong> disk cache, proving it to be ineffective in reducing application<br />
launch times.<br />
Nowadays, some computer manufactures provide application developers with some<br />
“Launch Time Performance Guidelines” [8] that need to be followed in order to improve<br />
application performance at launch time. This involves delaying <strong>the</strong> loading and initialization<br />
<strong>of</strong> subsystems that are not required immediately, until later. This helps to speed up<br />
<strong>the</strong> launch time considerably. However, this approach cannot help reduce <strong>the</strong> latency <strong>of</strong><br />
<strong>the</strong> code that is absolutely necessary to launch <strong>the</strong> application.<br />
On <strong>the</strong> o<strong>the</strong>r hand, some applications load a part <strong>of</strong> <strong>the</strong>ir application code into main<br />
memory when <strong>the</strong> operating system boots up. This is done so that <strong>the</strong> application appears<br />
to load faster when <strong>the</strong> user launches it. In addition to wasting precious main memory,<br />
this scheme gives <strong>the</strong> user <strong>the</strong> perception that <strong>the</strong> operating system takes a long time to<br />
load and doesn’t really reduce <strong>the</strong> overall application launch time.<br />
Ano<strong>the</strong>r approach that operating systems commonly employ is to optimize <strong>the</strong> HDD<br />
by reducing file fragmentation on <strong>the</strong> disk. This is done by periodically defragmenting<br />
<strong>the</strong> HDD. This results in lower seek and rotational latencies, meaning that applications<br />
are able to load faster.<br />
124
Improving Application Launch Times<br />
Micros<strong>of</strong>t claims that “Windows ReadyBoot” [4] helps decrease <strong>the</strong> time required to<br />
boot <strong>the</strong> system by preloading <strong>the</strong> files required during <strong>the</strong> booting phase. ReadyBoot<br />
saves a file trace <strong>of</strong> <strong>the</strong> files used when <strong>the</strong> system boots up. It <strong>the</strong>n uses idle CPU time to<br />
analyze file traces from five previous boots making a note <strong>of</strong> <strong>the</strong> accessed files along with<br />
<strong>the</strong>ir location on disk. During subsequent system boots, ReadyBoot prefetches <strong>the</strong>se files<br />
into an in-RAM cache, saving <strong>the</strong> boot process <strong>the</strong> time required to retrieve <strong>the</strong> files from<br />
<strong>the</strong> disk.<br />
This paper looks at different approaches that have been employed to tackle <strong>the</strong> problem<br />
<strong>of</strong> slow application launch times. Section 2 looks at how adaptive prefetching can be used<br />
to predict what applications a user might run in <strong>the</strong> near future and fetch <strong>the</strong>m into main<br />
memory in order to achieve faster application launch times. Section 3 looks at how hybrid<br />
hard disks(H-HDD) could be used to improve application launch times. Section 4 looks<br />
at how <strong>the</strong> performance <strong>of</strong> a solid state drives(SSD) could be fur<strong>the</strong>r improved to reduce<br />
application launch times. Section 5 compares <strong>the</strong> pricing schemes <strong>of</strong> HDDs, H-HDDs<br />
and SSDs. We conclude <strong>the</strong> paper in Section 6.<br />
2 Adaptive Prefetching<br />
Prefetching is a well know concept and has been used to prefetch instructions for processors,<br />
prefetch data from main memory into <strong>the</strong> processor cache [15] and prefetching links<br />
on webpages [7] to name a few. This section takes a closer look at Preload [6] which is an<br />
adaptive prefetcher that is capable <strong>of</strong> predicting when an application might be launched<br />
by <strong>the</strong> user and preloads it into main memory. This helps reduce HDD latencies and hence<br />
reduce application launch times.<br />
2.1 Preload<br />
Preload consists <strong>of</strong> <strong>the</strong> following two components:<br />
1. Data ga<strong>the</strong>ring and model training<br />
2. Predictor<br />
These components are fairly isolated and are connected toge<strong>the</strong>r using a shared probabilistic<br />
model. Data is ga<strong>the</strong>red by monitoring <strong>the</strong> user actions and is used to train <strong>the</strong><br />
model. The predictor uses this model to predict which application will be launched and<br />
<strong>the</strong>n prefetches that application.<br />
Typical GUI applications have larger binaries, larger working sets, longer running<br />
times and are inherently more complex when compared to o<strong>the</strong>r Unix programs. The<br />
goal <strong>of</strong> preload is to achieve faster “application” start-up times. In order to do this, it<br />
needs to distinguish between an “application” and ano<strong>the</strong>r program. Preload ignores any<br />
processes that are very short-lived, or who’s address space are smaller than a specified<br />
size. By ignoring <strong>the</strong>se processes, Preload is able to keep <strong>the</strong> size <strong>of</strong> <strong>the</strong> model down.<br />
125
Gavin Vaz<br />
The processes running on <strong>the</strong> system are filtered according to <strong>the</strong> above criteria to obtain<br />
a list <strong>of</strong> running applications. Information <strong>of</strong> all <strong>the</strong>se running applications are collected<br />
periodically by <strong>the</strong> data ga<strong>the</strong>ring component. The period <strong>of</strong> this cycle is a configurable<br />
parameter and is set to twenty seconds if not explicitly specified. Finally, a list <strong>of</strong> <strong>the</strong><br />
applications memory maps are fetched for each application and are used to update <strong>the</strong><br />
model [6].<br />
The predictor, like <strong>the</strong> data ga<strong>the</strong>ring component is also invoked periodically. It uses<br />
<strong>the</strong> trained model along with <strong>the</strong> list <strong>of</strong> presently running applications to predict which applications<br />
should be prefetched. For every application that is not running, <strong>the</strong> probability<br />
<strong>of</strong> it starting in <strong>the</strong> next cycle is computed. The predictor <strong>the</strong>n uses <strong>the</strong>se per-application<br />
probabilities to assign probabilities to <strong>the</strong>ir maps. It <strong>the</strong>n sorts <strong>the</strong> maps based on <strong>the</strong>ir<br />
probabilities, and proceeds with prefetching <strong>the</strong> top ones into main memory. In order to<br />
minimize <strong>the</strong> system load due to prefetching, system load and memory statistics are used<br />
to decide how much prefetching is performed in each cycle [6].<br />
2.2 Implementation Overhead<br />
Preload runs as a daemon process on <strong>the</strong> system and has a modest memory footprint [6].<br />
The model which resides in main memory consumes less than 3MB <strong>of</strong> memory for around<br />
hundred applications. The process is asleep for most <strong>of</strong> <strong>the</strong> time waking up periodically<br />
or whenever <strong>the</strong> processor is idle. This ensures that it does not affect <strong>the</strong> performance <strong>of</strong><br />
o<strong>the</strong>r applications running on <strong>the</strong> system. Once launched, Preload takes a few cycles to<br />
settle down into a steady state. After this, it stops making any new I/O requests and hence<br />
does not interfere with power-saving schemes used in most modern systems.<br />
2.3 Performance Evaluation<br />
In order to evaluate its performance, <strong>the</strong> application launch times obtained with Preload<br />
were compared to those obtained when <strong>the</strong> page cache was cleared (cold-cache) and to<br />
those when <strong>the</strong> application was already present in <strong>the</strong> page cache (warm-cache). The<br />
cold-cache scheme represents an application launch when a user has not launched <strong>the</strong><br />
application before and so <strong>the</strong>re are no application related entries in <strong>the</strong> page cache. On<br />
<strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> warm-cache scheme represents an application launch when a user has<br />
previously launched an application. Table 1 shows <strong>the</strong> time taken for various applications<br />
to launch for <strong>the</strong> three different scenarios. It is apparent from <strong>the</strong> results, that Preload is<br />
able to reduce application launch times when compared to <strong>the</strong> cold application launch.<br />
The average reduction in application launch time by using Preload is around 44%. It can<br />
also be seen that Preload is more effective at reducing application launch times for large<br />
applications, making it a good solution for improving application launch times.<br />
126
Improving Application Launch Times<br />
Application Cold Warm Preload Gain Size<br />
OpenOffice.org Writer 15s 2s 7s 53% 90 MB<br />
Firefox Web Browser 11s 2s 5s 55% 38 MB<br />
Evolution Mailer 9s 1s 4s 55% 85 MB<br />
Gedit Text Editor 6s 0.1s 4s 33% 52 MB<br />
Gnome Terminal 4s 0.4s 3s 25% 27 MB<br />
Table 1: Application start-up time with cold and warm caches, and with preload [6].<br />
3 Hybrid Disks<br />
Figure 2: Hybrid disk logical hierarchy [9].<br />
A hybrid disk(H-HDD) is a traditional HDD combined with embedded flash memory.<br />
The embedded flash memory could be arranged as a new level <strong>of</strong> hierarchy between <strong>the</strong><br />
main memory and <strong>the</strong> disk (see Fig. 2(a)) or at <strong>the</strong> same level <strong>of</strong> hierarchy as <strong>the</strong> disk (see<br />
Fig. 2(b)).<br />
Flash memory when used in a configuration represented by Figure 2(a), can be used<br />
as a second level disk cache [11]. Introducing nonvolatile flash memory at this level helps<br />
retain <strong>the</strong> contents <strong>of</strong> <strong>the</strong> flash disk cache even after <strong>the</strong> system is rebooted. However, this<br />
scheme produces a low hit ratio unless <strong>the</strong> flash cache is very large [9]. Flash memory<br />
in this configuration can also be used as Write-Only Disk Caches(WODC) [14]. The<br />
WODC holds blocks <strong>of</strong> data that are to be written to <strong>the</strong> disk. This data can <strong>the</strong>n be<br />
transferred to <strong>the</strong> HDD asynchronously with virtually no latency, resulting in improved<br />
HDD performance. However, application launches have very little write traffic making<br />
WODC ineffective in this scenario.<br />
When flash memory is used at <strong>the</strong> same hierarchy level as that <strong>of</strong> <strong>the</strong> disk (see Fig.<br />
2(b)), a small portion <strong>of</strong> it can be used to pin data. This is referred to as “OEM-pinned<br />
data”. Table 2 shows <strong>the</strong> different cache allocation depending on <strong>the</strong> different sizes <strong>of</strong><br />
127
Gavin Vaz<br />
Flash size 128 MB 256 MB<br />
H-HDD firmware 10 MB 10 MB<br />
Write cache 32 MB 32 MB<br />
OEM-pinned data 15 MB 79 MB<br />
SuperFetchTM pinned data 71 MB 135 MB<br />
Table 2: Manufacturer recommendation for <strong>the</strong> flash memory partition in <strong>the</strong> H-HDD [9].<br />
Linux Ubuntu 8.04 Windows Vista Ultimate<br />
Evolution 2.22.1 16.9 MB Excel 2007 15.0 MB<br />
Firefox 3.0b5 27.1 MB Labview 8.5.1 45.0 MB<br />
F-Spot 0.4.2 27.4 MB Outlook 2007 16.7 MB<br />
Gimp 2.4.5 15.6 MB Photoshop CS2 62.4 MB<br />
Rhythmbox 0.11.5 17.9 MB Powerpoint 2007 14.7 MB<br />
Totem 2.22.1 10.7 MB Word 2007 27.3 MB<br />
Table 3: Code block size required for application launch [9].<br />
flash memory. The OEM-pinned data cache can be used to pin application data in order to<br />
improve application launch times. However, due to <strong>the</strong> size limitation <strong>of</strong> <strong>the</strong> OEM-pinned<br />
data cache, it is not possible to pin all <strong>the</strong> data required to launch an application (see Table<br />
3). This section looks at a method proposed by Youngjin Joo et al. [9] that improves <strong>the</strong><br />
application launch time by pinning only a small subset <strong>of</strong> <strong>the</strong> application data. The idea<br />
here is to select an optimal pinned-set for an application given <strong>the</strong> size limitation <strong>of</strong> <strong>the</strong><br />
OEM-pinned data cache, so that <strong>the</strong> seek time and rotational latency <strong>of</strong> <strong>the</strong> HDD can be<br />
minimized.<br />
3.1 Pinned-set Selection<br />
The following steps need to be performed in order to obtain <strong>the</strong> pinned-set <strong>of</strong> an application.<br />
1. Determine <strong>the</strong> application launch sequence from <strong>the</strong> raw block requests<br />
2. Derive an access cost model <strong>of</strong> H-HDDs<br />
3. Formulate pinned-set optimization as an ILP problem<br />
Figure 3 shows <strong>the</strong> framework <strong>of</strong> <strong>the</strong> method used to determine this pinned-set selection.<br />
The first step is to extract <strong>the</strong> application launch sequence for a given application.<br />
S<strong>of</strong>tware based disk I/O pr<strong>of</strong>iling tools like Blktrace [3] (Linux) and TraceView[12]<br />
(Windows) are able capture raw block requests during <strong>the</strong> application launch. However,<br />
on a typical computer system, <strong>the</strong>re might be o<strong>the</strong>r processes running as well. These<br />
processes might also request blocks from <strong>the</strong> disk. These rogue block requests have no<br />
128
Improving Application Launch Times<br />
Figure 3: Framework <strong>of</strong> <strong>the</strong> proposed method <strong>of</strong> pinned-set selection [9].<br />
connection to <strong>the</strong> application launch but are captured by <strong>the</strong> pr<strong>of</strong>iling tools. The application<br />
launch sequence extractor is used to clean up <strong>the</strong> application launch sequence by<br />
eliminating <strong>the</strong>se rogue block requests. After a sufficient number <strong>of</strong> raw block request<br />
sequences have been obtained from <strong>the</strong> disk I/O pr<strong>of</strong>iling tool, <strong>the</strong> application launch sequence<br />
extractor performs <strong>the</strong> following steps to target and eliminate such rogue block<br />
requests.<br />
1. Any block requests that access read-write blocks are removed, as application code<br />
blocks are only considered for pinning.<br />
2. Block requests which do not occur in all <strong>the</strong> raw block request sequences are removed.<br />
3. Block requests which do not occur in <strong>the</strong> same position in all <strong>the</strong> raw block request<br />
sequences are removed.<br />
Once <strong>the</strong> clean application launch sequence has been obtained, <strong>the</strong> access cost matrix<br />
can now be built using <strong>the</strong> application launch sequence along with <strong>the</strong> H-HDD performance<br />
specification. Youngjin Joo et al. also proposed an ILP formulation [9] for a<br />
given application launch sequence and access cost matrix.<br />
3.2 Implementation Overhead<br />
Generating a clean application launch sequence can take up to 0.6 seconds. While, computing<br />
<strong>the</strong> access cost matrix takes up to 1.5 seconds. However, <strong>the</strong> time taken to solve<br />
<strong>the</strong> ILP problem dominates <strong>the</strong> computation time. The time required to solve <strong>the</strong> ILP<br />
129
Gavin Vaz<br />
Figure 4: Computation times required to solve <strong>the</strong> ILP problem (pinned-set size: 10% <strong>of</strong><br />
<strong>the</strong> application launch sequence size) [9].<br />
is proportional to <strong>the</strong> size <strong>of</strong> <strong>the</strong> application launch sequence; i.e. <strong>the</strong> larger <strong>the</strong> application<br />
launch sequence is, <strong>the</strong> more time it will take to solve <strong>the</strong> ILP. Figure 4 shows<br />
how <strong>the</strong> computation time increases with <strong>the</strong> increase in size <strong>of</strong> <strong>the</strong> application launch<br />
sequence. This however, seems to be an acceptable trade<strong>of</strong>f as <strong>the</strong> computation does not<br />
have to be repeated once <strong>the</strong> pinned-set has been obtained. However, over <strong>the</strong> course <strong>of</strong><br />
time, <strong>the</strong> application data may change or <strong>the</strong> blocks <strong>of</strong> an application might be relocated<br />
during disk optimization. Thus making <strong>the</strong> current pinned-set ineffective and forcing a<br />
re-computation <strong>of</strong> <strong>the</strong> pinned-set.<br />
The time taken to compute <strong>the</strong> ILP solution can in fact be reduced. However, this<br />
is obtained by compromising <strong>the</strong> quality <strong>of</strong> <strong>the</strong> solution. For example, a solution within<br />
0.01% <strong>of</strong> <strong>the</strong> <strong>the</strong>oretical bound can be obtained in 65 seconds, but this can be reduced to<br />
26 seconds by accepting an error <strong>of</strong> 0.2% [9].<br />
3.3 Performance Evaluation<br />
In order to evaluate <strong>the</strong> performance <strong>of</strong> <strong>the</strong>ir proposed pinning method, Youngjin Joo et<br />
al. compared it with <strong>the</strong> following two pinning approaches [9].<br />
3.3.1 First-Come First-Pinned<br />
The first-come first-pinned (FCFP) policy pins <strong>the</strong> blocks in <strong>the</strong> order in which <strong>the</strong>y appear<br />
in <strong>the</strong> application launch sequence. The blocks are pinned until <strong>the</strong>y fill <strong>the</strong> pinnedset<br />
partition <strong>of</strong> <strong>the</strong> flash memory. Now, when an application is launched, all <strong>the</strong> starting<br />
block requests are serviced by <strong>the</strong> flash memory; eliminating disk seek times and rotational<br />
latencies during this phase. Due to which, <strong>the</strong>re is a reduction in <strong>the</strong> total H-HDD<br />
access time. This reduction is proportional to <strong>the</strong> size <strong>of</strong> <strong>the</strong> pinned data set.<br />
3.3.2 Small-Chunks-First<br />
Disk seek time and rotational latencies are independent <strong>of</strong> <strong>the</strong> block size; i.e. it does<br />
not matter if <strong>the</strong> <strong>the</strong> block requested is large or small, we are still going to see nearly<br />
130
Improving Application Launch Times<br />
Figure 5: Values <strong>of</strong> thhd for various sizes <strong>of</strong> pinned-set. The x-axes are normalized to <strong>the</strong><br />
size <strong>of</strong> <strong>the</strong> application launch sequence [9].<br />
<strong>the</strong> same delays caused because <strong>of</strong> disk latencies. The small-chunks-first (SCF) policy<br />
fills in <strong>the</strong> pinned-set partition <strong>of</strong> <strong>the</strong> flash memory by pinning <strong>the</strong> smallest blocks first.<br />
Thereby maximizing <strong>the</strong> number <strong>of</strong> blocks stored in flash memory. This in turn, reduces<br />
<strong>the</strong> number <strong>of</strong> block requests that are sent to <strong>the</strong> disk and hence avoids delays caused by<br />
disk seek time and rotational latencies.<br />
In order to evaluate <strong>the</strong>se approaches, ten raw block request sequences for each benchmark<br />
application were captured and used as an input to <strong>the</strong> application launch sequence<br />
extractor. The resulting clean application launch sequence was <strong>the</strong>n used to calculate <strong>the</strong><br />
access cost matrix for each application. This was <strong>the</strong>n used with <strong>the</strong> ILP solver to obtain<br />
<strong>the</strong> pinned-set for different sizes <strong>of</strong> flash memory.<br />
Figure 5 shows <strong>the</strong> H-HDD access time (thdd) for various pinned-set sizes for Evolution,<br />
Firefox, Photoshop and Powerpoint. The shaded area represents <strong>the</strong> region where<br />
Youngjin Joo et al. think that it would be beneficial to increase <strong>the</strong> pinned-set size while<br />
using <strong>the</strong>ir proposed method. The optimal pinned-set size for applications running on<br />
Micros<strong>of</strong>t Windows is around 30% <strong>of</strong> <strong>the</strong> application launch sequence and that for application<br />
running on Linux is around 20%. This suggests that relatively small pinned-sets<br />
are effective with <strong>the</strong>ir proposed method.<br />
Table 4 shows <strong>the</strong> results <strong>of</strong> <strong>the</strong> experiment when 10% <strong>of</strong> <strong>the</strong> application data was<br />
pinned to <strong>the</strong> flash memory. It also shows <strong>the</strong> improvement in <strong>the</strong> application launch<br />
time (tlaunch) and <strong>the</strong> H-HDD access time (thdd) for <strong>the</strong> different pinning approaches.<br />
The proposed method is able to reduce <strong>the</strong> H-HDD access time by 34% if 10% <strong>of</strong> <strong>the</strong><br />
application data was pinned. This improvement in H-HDD performance, saw a reduction<br />
<strong>of</strong> 24% in <strong>the</strong> average application launch time [9].<br />
4 Solid State Drives<br />
A solid state drive (SSD) is made up <strong>of</strong> a number <strong>of</strong> NAND flash memory modules which<br />
have no mechanical parts. Thereby, eliminating <strong>the</strong> disk seek time and rotational latencies<br />
that are observed in traditional HDDs. A reasonable solution for improving application<br />
131
Gavin Vaz<br />
Application<br />
No pinning (sec) FCFP SCF Proposed<br />
thdd tlaunch thdd tlaunch thdd tlaunch thdd tlaunch<br />
Evolution 5.70 7.26 93.1% 94.6% 77.7% 82.5% 59.4% 68.1%<br />
Firefox 6.82 8.23 89.8% 91.6% 65.3% 71.3% 53.8% 61.7%<br />
Photoshop 17.36 30.78 89.7% 94.2% 78.1% 87.7% 71.6% 84.0%<br />
Powerpoint 7.25 12.95 95.3% 97.4% 84.9% 91.6% 80.1% 88.8%<br />
Table 4: thhd and tlaunch for a pinned-set <strong>of</strong> 10% <strong>of</strong> <strong>the</strong> application launch sequence size<br />
[9].<br />
launch times would be to replace a traditional HDD with a SSD. But with growing application<br />
sizes, it is only a matter <strong>of</strong> time before SSDs will eventually appear to be slow.<br />
This section looks at how application launch times can be fur<strong>the</strong>r improved on SSDs by<br />
using <strong>the</strong> Fast Application STarter (FAST) application prefetching method proposed by<br />
Youngjin Joo et al. [10].<br />
Many <strong>of</strong> <strong>the</strong> optimization techniques used with traditional HDDs cannot be used with<br />
SSDs. For example, defragmenting a SSD to improve its performance doesn’t make any<br />
sense, as <strong>the</strong> physical location <strong>of</strong> data does not affect access latencies. Employing such a<br />
technique would only shorten <strong>the</strong> life <strong>of</strong> <strong>the</strong> SSD. In fact when a modern operating system<br />
detects a SSD, it disables <strong>the</strong> optimization techniques used for traditional HDDs. For<br />
example, when Windows 7 detects that a SSD is being used, it disables disk defragmentation,<br />
Superfetch, and Readyboost [13].<br />
4.1 FAST<br />
Figure 6(a) shows how a typical application launch is handled. Here si is <strong>the</strong> i-th block<br />
request generated during <strong>the</strong> launch and n being <strong>the</strong> total number <strong>of</strong> blocks requested.<br />
After a block is fetched, <strong>the</strong> <strong>the</strong> CPU can proceed with <strong>the</strong> launch process (ci) until ano<strong>the</strong>r<br />
page miss occurs. This cycle is repeated until <strong>the</strong> application is launched.<br />
Let <strong>the</strong> time spent for si and ci be denoted by t(si) and t(ci), respectively. Then, <strong>the</strong><br />
computation (CPU) time, tcpu , is expressed as<br />
tcpu =<br />
n�<br />
t(ci) (1)<br />
i=1<br />
and <strong>the</strong> SSD access (I/O) time, tssd , is expressed as<br />
tssd =<br />
n�<br />
t(si) (2)<br />
i=1<br />
Then, <strong>the</strong> application launch time can be expressed as<br />
tlaunch = tssd + tcpu<br />
132<br />
(3)
Improving Application Launch Times<br />
Figure 6: Various application launch scenarios (n = 4) [10].<br />
The main idea <strong>of</strong> FAST is to overlap <strong>the</strong> I/O with <strong>the</strong> CPU, so as to minimize tssd.<br />
This is obtained by running <strong>the</strong> application prefetcher concurrently with <strong>the</strong> application.<br />
The application prefetcher <strong>the</strong>n fetches <strong>the</strong> application launch sequence (s1, ..., sn) while<br />
<strong>the</strong> application is being launched (tcpu).<br />
One possible scenario for FAST would be when <strong>the</strong> <strong>the</strong> computation time is larger than<br />
<strong>the</strong> SSD access time (tcpu > tssd) . This is illustrated in Figure 6(b). At time t = 0, <strong>the</strong><br />
application and <strong>the</strong> prefetcher are started simultaneously. They compete with one ano<strong>the</strong>r<br />
to access <strong>the</strong> SSD. However, since both are requesting for <strong>the</strong> same block s1, it does not<br />
matter as to who gets <strong>the</strong> bus grant first. After s1 has been fetched, <strong>the</strong> application can<br />
start with <strong>the</strong> launch (c1) while <strong>the</strong> prefetcher continues to fetch <strong>the</strong> subsequent blocks.<br />
When it is time for <strong>the</strong> application to request for <strong>the</strong> next block, it is already present in<br />
memory and so <strong>the</strong>re is no page miss. And hence, <strong>the</strong> resulting application launch time<br />
(tlaunch) becomes<br />
tlaunch = t(s1) + tcpu<br />
Ano<strong>the</strong>r possible scenario for FAST would be when <strong>the</strong> computation time is smaller<br />
than <strong>the</strong> SSD access time (tcpu < tssd) . This is illustrated in Figure 6(c). Here, <strong>the</strong><br />
prefetcher is not able to fetch <strong>the</strong> entire s2 block before <strong>the</strong> application requests for it.<br />
However, this is still faster as compared to <strong>the</strong> scenario in Figure 6(a). This improvement<br />
is accumulated over <strong>the</strong> remaining block requests, resulting in a tlaunch:<br />
(4)<br />
tlaunch = tssd + t(cn) (5)<br />
However, n ranges up to a few thousands for typical applications, and thus, t(s1) ≪<br />
tcpu and t(cn) ≪ tssd [10]. Consequently, Eqs. (4) and (5) can be combined into a single<br />
equation as:<br />
tlaunch ≈ max(tssd, tcpu) (6)<br />
133
Gavin Vaz<br />
4.2 Implementation<br />
Figure 7: The proposed application prefetching [10].<br />
The processes <strong>of</strong> FAST can be divided into two broad categories depending on whe<strong>the</strong>r<br />
<strong>the</strong>y run during <strong>the</strong> application launch time or as an idle process. Figure 7 shows <strong>the</strong><br />
different components <strong>of</strong> FAST and how <strong>the</strong>y interact with one ano<strong>the</strong>r.<br />
Blktrace [3], a disk I/O pr<strong>of</strong>iler is used to record <strong>the</strong> raw block request sequence that<br />
is requested during <strong>the</strong> application launch. The device number, LBA, I/O size and <strong>the</strong><br />
completion time are also recorded. However, <strong>the</strong> operating system or some o<strong>the</strong>r process<br />
might also be accessing <strong>the</strong> disk during <strong>the</strong> application launch. And so <strong>the</strong> raw block<br />
request sequence captured by Blktrace varies from one launch to ano<strong>the</strong>r. The application<br />
launch sequence extractor cleans up <strong>the</strong> raw block request sequence by collecting two or<br />
more raw block request sequences and <strong>the</strong>n extracting a common sequence. This sequence<br />
is known as <strong>the</strong> application launch sequence.<br />
A block can be represented as a file and an <strong>of</strong>fset within that file. The application<br />
prefetcher can request for a specific block(LBA) by issuing a system call and providing<br />
it with <strong>the</strong> file name and <strong>of</strong>fset. However, finding <strong>the</strong> file name and <strong>of</strong>fset from a given<br />
LBA is not supported by most file systems. In order to find this mapping, a system-call<br />
pr<strong>of</strong>iler (strace) is used to obtain a complete list <strong>of</strong> files that were accessed during <strong>the</strong><br />
application launch. The LBA-to-inode reverse mapper is <strong>the</strong>n used to create a LBA-toinode<br />
map from <strong>the</strong>se files. The LBA-to-inode reverse mapper uses a red-black tree in<br />
order to reduce <strong>the</strong> search time <strong>of</strong> <strong>the</strong> LBA-to-inode map.<br />
The application prefetcher is a user-level program that replays <strong>the</strong> disk access requests<br />
made by a target application [10]. The application prefetcher generator automatically<br />
creates an application prefetcher for each target application. It performs <strong>the</strong><br />
following operations.<br />
1. Read si one-by-one from <strong>the</strong> application launch sequence <strong>of</strong> <strong>the</strong> target application.<br />
2. Convert si into its associated data items stored in <strong>the</strong> LBA-to-inode map.<br />
134
Improving Application Launch Times<br />
Running processes Runtime (sec)<br />
1. Application only (cold start scenario) 0.86<br />
2. strace + blktrace + application 1.21<br />
3. blktrace + application 0.88<br />
4. Prefetcher generation 5.01<br />
5. Prefetcher + application 0.56<br />
6. Prefetcher + blktrace + application 0.59<br />
7. Miss ratio calculation 0.90<br />
Table 5: Runtime overhead (application: Firefox) [10].<br />
3. Depending on <strong>the</strong> type <strong>of</strong> block, generate an appropriate system call using <strong>the</strong> converted<br />
disk access information.<br />
4. Repeat Steps 1–3 until processing all si.<br />
Once <strong>the</strong> application prefetcher for an application is created, it is invoked by <strong>the</strong> application<br />
launch manager whenever <strong>the</strong> application is launched.<br />
4.3 Implementation Overhead<br />
Table 5 shows <strong>the</strong> runtime overhead <strong>of</strong> FAST for Firefox. Case 2 is run only once. Case 3<br />
runs for <strong>the</strong> number <strong>of</strong> raw block request sequences that were captured. However, Cases<br />
2 and 3 are run only when no application prefetcher is found for that application. The<br />
application prefetcher is generated in Case 4 and has <strong>the</strong> highest runtime. This however,<br />
can be hidden from <strong>the</strong> user by running it in <strong>the</strong> background. Cases 5-7 are a part <strong>of</strong><br />
<strong>the</strong> application prefetcher and are repeated until <strong>the</strong> application prefetcher is invalidated.<br />
Case 7 can also be run in <strong>the</strong> background effectively hiding it from <strong>the</strong> user.<br />
FAST also creates some temporary files, but <strong>the</strong>y can be deleted once <strong>the</strong> application<br />
prefetcher has been created. However, <strong>the</strong> actual application prefetcher and <strong>the</strong> application<br />
launch sequences occupies disk space. In <strong>the</strong> experiments performed by Youngjin<br />
Joo et al., <strong>the</strong> total size <strong>of</strong> <strong>the</strong> application prefetchers and application launch sequences<br />
for all 22 applications was 7.2 MB [10].<br />
4.4 Performance Evaluation<br />
In order to evaluate <strong>the</strong> performance <strong>of</strong> FAST, Youngjin Joo et al. compared it with <strong>the</strong><br />
following scenarios [10].<br />
• Cold start: The application is launched immediately after flushing <strong>the</strong> page cache.<br />
The resulting launch time is denoted by tcold.<br />
• Warm start: At first only <strong>the</strong> application prefetcher is run. Is is done so that all <strong>the</strong><br />
application launch sequence blocks are loaded into <strong>the</strong> page cache. The application<br />
is <strong>the</strong>n immediately run after this. The resulting launch time is denoted by twarm.<br />
135
Gavin Vaz<br />
Figure 8: Measured application launch time (normalized to tcold) [10].<br />
• Sorted prefetch: The application prefetcher was modified to fetch <strong>the</strong> block requests<br />
in <strong>the</strong> application launch sequence in <strong>the</strong> sorted order <strong>of</strong> <strong>the</strong>ir LBAs. After flushing<br />
<strong>the</strong> page cache, <strong>the</strong> modified application prefetcher was run, after which <strong>the</strong><br />
application was immediately launched and <strong>the</strong> resulting launch time is denoted by<br />
tsorted<br />
• FAST: The application was simultaneously run along with <strong>the</strong> application prefetcher<br />
after flushing <strong>the</strong> page cache. The resulting launch time is denoted by tF AST .<br />
• Prefetcher only: The application prefetcher is run after <strong>the</strong> page cache is flushed.<br />
The completion time <strong>of</strong> <strong>the</strong> application prefetcher is denoted by tssd. And is used<br />
to calculate a lower bound <strong>of</strong> <strong>the</strong> application launch time tbound = max(tssd, tcpu),<br />
where tcpu = twarm is assumed.<br />
Launch times were recorded for all <strong>the</strong> above scenarios. Figure 8 shows <strong>the</strong> results<br />
that have been normalized to tcold. FAST saw an average reduction <strong>of</strong> 28% in <strong>the</strong> launch<br />
time as compared to <strong>the</strong> cold start scenario, while a HDD-aware application launcher<br />
only showed a 7% reduction. FAST was able to achieve this with no additional overhead,<br />
demonstrating <strong>the</strong> need for, and <strong>the</strong> utility <strong>of</strong>, a new SSD-aware optimizer [10].<br />
5 HDDs, H-HDDs & SSDs<br />
When HDDs made <strong>the</strong>ir first appearance, <strong>the</strong>y were expensive. However, with <strong>the</strong> advancements<br />
in technology and <strong>the</strong>ir ever growing demand <strong>the</strong>y have now become affordable<br />
with costs as low as $0.16 per GB. SSDs today are all about performance with<br />
sequential read speeds <strong>of</strong> up to 270 megabytes per second (MB/s). However, <strong>the</strong>y are<br />
relatively expensive with a average cost <strong>of</strong> $2.15 per GB, nearly thirteen times more expensive<br />
than traditional HDDs. This improved performance does in fact come at a high<br />
price. With time, SSDs might follow <strong>the</strong> trend seen in HDDs and eventually become affordable;<br />
but for <strong>the</strong> time being, do we have something that could match <strong>the</strong> performance<br />
<strong>of</strong> a SDD and <strong>the</strong> price <strong>of</strong> a HDD? The answer is yes, H-HDDs are able to bridge this<br />
gap by embedding flash memory into a traditional HDD. They perform nearly three times<br />
better than traditional HDDs [1] and with a cost <strong>of</strong> $0.33 per GB, are nearly 1/6th <strong>the</strong> cost<br />
136
Improving Application Launch Times<br />
Capacity<br />
HDD - Seagate Momentus<br />
Price<br />
H-HDD - Seagate Momentus XT SSD - Intel 320 Series<br />
750 GB $120 $245 (8 GB flash) -<br />
600 GB - - $1260<br />
500 GB $80 $150 (4 GB flash) -<br />
320 GB $130 $125 (4 GB flash) -<br />
300 GB - - $630<br />
250 GB $90 $140 (4 GB flash) -<br />
160 GB $160 - $340<br />
120 GB $80 - $260<br />
80 GB $65 - $200<br />
40 GB $45 - $110<br />
Table 6: Prices for 2.5" drives [2, 5].<br />
Approach<br />
HDD<br />
Device<br />
H-HDD SSD Smartphone<br />
Preload � ✗ ✗ ✗<br />
OEM-pinned data ✗ � ✗ ✗<br />
FAST ✗ ✗ � �<br />
Table 7: Approaches and supported devices<br />
<strong>of</strong> SSDs. Table 6 compares <strong>the</strong> prices <strong>of</strong> HDDs, H-HDDs and SSDs <strong>of</strong> various capacities.<br />
From <strong>the</strong> looks <strong>of</strong> it, H-HDDs give you plenty <strong>of</strong> bang for <strong>the</strong> buck and are here to stay.<br />
6 Conclusion<br />
This paper looked at three approaches that could be used to improve application launch<br />
times. Table 7 shows <strong>the</strong> various approaches and <strong>the</strong> devices that <strong>the</strong>y could be used<br />
with. Preload makes use <strong>of</strong> a prefetching approach to improve application launch times.<br />
It tries to predict when an application might be launched and <strong>the</strong>n preloads it into main<br />
memory. Hence, when an application is eventually launched, <strong>the</strong> application launch data<br />
is already present in main memory, resulting in a faster application launch. This paper<br />
also looked at how H-HDDs could be used to improve application launch times. This<br />
approach looked at how <strong>the</strong> OEM-pinned data cache <strong>of</strong> a H-HDD could be effectively<br />
used to reduce <strong>the</strong> average application launch time. Using this approach, <strong>the</strong> average<br />
application launch time could be reduced by 24%, by pinning only 10% <strong>of</strong> <strong>the</strong> application<br />
launch sequence. Finally, <strong>the</strong> paper looked at FAST, an optimization technique that can<br />
be applied to already fast SSDs. Using FAST, <strong>the</strong> application launch times on SSDs could<br />
be reduced by 28%. FAST has excellent portability [10] and it would be interesting to see<br />
how it could be used with state-<strong>of</strong>-<strong>the</strong>-art devices like smartphones or tablets.<br />
137
Gavin Vaz<br />
References<br />
[1] http://www.seagate.com/www/en-us/products/laptops/<br />
laptop-hdd/. [Online; accessed 30-November-2011].<br />
[2] http://www.amazon.com. [Online; accessed 30-November-2011].<br />
[3] Jens Axboe. Block io tracing. https://git.kernel.org/?p=linux/<br />
kernel/git/axboe/blktrace.git;a=blob;f=README, September<br />
2006. [Online; accessed 26-November-2011].<br />
[4] Micros<strong>of</strong>t Corporation. Windows pc accelerators. http://www.micros<strong>of</strong>t.<br />
com/whdc/system/sysperf/perfaccel.mspx, October 2010. [Online;<br />
accessed 25-November-2011].<br />
[5] Nathan Edwards. Seagate momentus xt 750gb review. http://www.<br />
maximumpc.com/article/reviews/seagate_momentus_xt_<br />
750gb_review, November 2011. [Online; accessed 30-November-2011].<br />
[6] Behdad Esfahbod. Preload - an adaptive prefetching daemon. Master’s <strong>the</strong>sis, University<br />
<strong>of</strong> Toronto, 2006.<br />
[7] Darin Fisher and Gagan Saksena. Link prefetching in mozilla: A server-driven<br />
approach. In Fred Douglis and Brian Davison, editors, Web Content Caching and<br />
Distribution, pages 283–291. Springer Ne<strong>the</strong>rlands, 2004.<br />
[8] Apple Computer Inc. Launch time performance guidelines. https:<br />
//developer.apple.com/library/mac/#documentation/<br />
Performance/Conceptual/LaunchTime/LaunchTime.html, April<br />
2006. [Online; accessed 25-November-2011].<br />
[9] Yongsoo Joo, Youngjin Cho, Kyungsoo Lee, and Naehyuck Chang. Improving application<br />
launch times with hybrid disks. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 7th IEEE/ACM<br />
international conference on <strong>Hardware</strong>/s<strong>of</strong>tware codesign and system syn<strong>the</strong>sis,<br />
CODES+ISSS ’09, pages 373–382, New York, NY, USA, 2009. ACM.<br />
[10] Yongsoo Joo, Junhee Ryu, Sangsoo Park, and Kang G. Shin. Fast: quick application<br />
launch on solid-state drives. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> 9th USENIX conference on<br />
File and stroage technologies, FAST’11, pages 19–19, Berkeley, CA, USA, 2011.<br />
USENIX Association.<br />
[11] B. Marsh, F. Douglis, and P. Krishnan. Flash memory file caching for mobile computers.<br />
In System Sciences, 1994. <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong> Twenty-Seventh Hawaii International<br />
Conference on, volume 1, pages 451 –460, jan. 1994.<br />
138
Improving Application Launch Times<br />
[12] Micros<strong>of</strong>t. Windows driver kit. http://msdn.micros<strong>of</strong>t.com/en-us/<br />
library/ff553872.aspx, September 2011. [Online; accessed 26-November-<br />
2011].<br />
[13] Steven Sin<strong>of</strong>sky. Support and q & a for solid-state drives.<br />
https://blogs.msdn.com/b/e7/archive/2009/05/05/<br />
support-and-q-a-for-solid-state-drives-and.aspx, May<br />
2009. [Online; accessed 28-November-2011].<br />
[14] Jon A. Solworth and Cyril U. Orji. Write-only disk caches. In <strong>Proceedings</strong> <strong>of</strong> <strong>the</strong><br />
1990 ACM SIGMOD international conference on Management <strong>of</strong> data, SIGMOD<br />
’90, pages 123–132, New York, NY, USA, 1990. ACM.<br />
[15] Steven P. Vanderwiel and David J. Lilja. Data prefetch mechanisms. ACM Comput.<br />
Surv., 32:174–199, June 2000.<br />
139