01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A Method for Accurate High-Level Performance Evaluation 201<br />

between dedicated HW accelerators and CPUs is performed. The paper shows<br />

examples <strong>of</strong> the performance estimation us<strong>in</strong>g traces and discusses orig<strong>in</strong>s <strong>of</strong> the<br />

errors produced by the proposed method.<br />

2 Related Work<br />

Performance evaluation <strong>of</strong> SoC architectures us<strong>in</strong>g high level models has been<br />

addressed <strong>in</strong> many research works. One <strong>of</strong> possible solutions for obta<strong>in</strong><strong>in</strong>g accurate<br />

tim<strong>in</strong>g <strong>in</strong>formation at system level is the <strong>in</strong>tegration <strong>of</strong> <strong>in</strong>struction set<br />

simulators <strong>in</strong>to high-level SystemC models. MPARM [2] is a simulation platform<br />

for MPSoC architectures, <strong>in</strong> which a cycle accurate model <strong>of</strong> an ARM<br />

processor is <strong>in</strong>tegrated <strong>in</strong>to SystemC environment. In [4], an ISS is co-simulated<br />

with SystemC us<strong>in</strong>g the GDB-kernel. In this method, the high level model receives<br />

tim<strong>in</strong>g <strong>in</strong>formation via commands sent through the debugg<strong>in</strong>g <strong>in</strong>terfaces.<br />

However, the <strong>in</strong>tegration <strong>of</strong> the <strong>in</strong>struction set simulators significantly decreases<br />

the simulation performance.<br />

Along with HW/SW co-simulation us<strong>in</strong>g an ISS, there is another model<strong>in</strong>g<br />

pr<strong>in</strong>ciple <strong>in</strong> which precise tim<strong>in</strong>g <strong>in</strong>formation is back-annotated to abstracted<br />

high-level simulations. This method is widely used <strong>in</strong> s<strong>of</strong>tware code <strong>in</strong>strumentation<br />

techniques. In [12], a target C code is used to generate a SystemC code<br />

<strong>in</strong> which the execution time is partially determ<strong>in</strong>ed at the compilation time. In<br />

turn, dynamic tim<strong>in</strong>g <strong>in</strong>formation, e.g. effects <strong>of</strong> caches and branch predictors, is<br />

obta<strong>in</strong>ed at simulation runtime us<strong>in</strong>g the models <strong>of</strong> the architecture components.<br />

In [6], a target code is <strong>in</strong>strumented with additional functional calls for keep<strong>in</strong>g<br />

a count <strong>of</strong> the executed cycles and for access<strong>in</strong>g the TLM communication<br />

<strong>in</strong>frastructure.<br />

In [7], the authors show how traces can be employed for generation <strong>of</strong> communication<br />

analysis graphs which are later used for evaluation <strong>of</strong> on-chip communication<br />

architectures. In Sesame [10] and Spade [8] frameworks, traces are used<br />

to represent the workload <strong>of</strong> multimedia application models constructed us<strong>in</strong>g<br />

Kahn Process<strong>in</strong>g Networks (KPN). However, <strong>in</strong> these approaches the range <strong>of</strong><br />

target applications is restricted to those that are adaptable to the KPN model<br />

<strong>of</strong> computation. Moreover, process<strong>in</strong>g latencies and communication transactions<br />

specified by these traces are very coarse-gra<strong>in</strong>ed, i.e. delay entries represent large<br />

blocks <strong>of</strong> computation. At this granularity level, exact memory traffic, which can<br />

significantly <strong>in</strong>fluence the application performance, is not addressed. For example,<br />

dur<strong>in</strong>g the computation <strong>of</strong> a KPN process, contention <strong>of</strong> multiple CPUs<br />

on the shared on-chip <strong>in</strong>terconnect dur<strong>in</strong>g cache misses cannot be considered at<br />

this abstraction level. In our method, traces specify precise patterns <strong>of</strong> memory<br />

accesses and process<strong>in</strong>g latencies allow<strong>in</strong>g for more accurate analysis <strong>of</strong> the<br />

application workload.<br />

S. Mahadevan et al. demonstrated a technique <strong>in</strong> which traces, obta<strong>in</strong>ed from<br />

a cycle accurate CPU simulator, are used to represent the component’s behavior<br />

exactly captured at its bus <strong>in</strong>terface [9]. Contrarily, <strong>in</strong> our method we def<strong>in</strong>e the<br />

traces at the <strong>in</strong>struction level and explicitly model the cache component which

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!