01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

236 F. Xudong et al.<br />

compared to the naive GPU version and obta<strong>in</strong>s as high as 15.06x speedup versus<br />

the CPU implementation run on an Intel Xeon E5405 CPU.<br />

The rema<strong>in</strong>der <strong>of</strong> this paper is arranged as follows: Section 2 describes the background<br />

<strong>in</strong>formation <strong>of</strong> programm<strong>in</strong>g with Brook+ on AMD Radeon HD4870. Section<br />

3 illustrates our optimization strategies. Section 4 evaluates the effectiveness<br />

<strong>of</strong> the strategies. Section 5 discusses related work and the f<strong>in</strong>al section states our<br />

conclusions.<br />

2 Background<br />

Us<strong>in</strong>g GPUs and general purpose CPUs to construct heterogeneous parallel systems<br />

has attracted much <strong>in</strong>terest <strong>in</strong> the field <strong>of</strong> high performance comput<strong>in</strong>g (HPC)<br />

[5]. GPUs’ powerful float<strong>in</strong>g-po<strong>in</strong>t operation capacity and high performance-perwatt<br />

qualify them as good accelerators to speedup CPU applications for high performance<br />

with relatively small system scale and low power consumption. In this<br />

section, we <strong>in</strong>troduce some background <strong>in</strong>formation concern<strong>in</strong>g programm<strong>in</strong>g on<br />

an AMD GPU platform us<strong>in</strong>g Brook+, <strong>in</strong>clud<strong>in</strong>g the micro architecture <strong>of</strong> Radeon<br />

HD4870 GPU and the Brook+ stream programm<strong>in</strong>g environment.<br />

2.1 Micro <strong>Architecture</strong><br />

In this paper, we use an AMD Radeon HD4870 GPU as the accelerator for<br />

the CPU. AMD’s HD4800 series (codename RV770) is the newest GPU <strong>in</strong> their<br />

stream comput<strong>in</strong>g l<strong>in</strong>eup which supports double precision float<strong>in</strong>g po<strong>in</strong>t operations.<br />

The RV770 core has 10 SIMD eng<strong>in</strong>es, each <strong>of</strong> which conta<strong>in</strong>s 16 thread<br />

processors. Each thread processor consists <strong>of</strong> five scalar stream cores. So there<br />

are <strong>in</strong> total 800 cores <strong>in</strong>tegrated on a s<strong>in</strong>gle die. The five cores can execute both<br />

s<strong>in</strong>gle-precision float<strong>in</strong>g po<strong>in</strong>t and <strong>in</strong>teger operations, with one <strong>of</strong> them be<strong>in</strong>g<br />

able to handle transcendental operations, such as s<strong>in</strong>, cos, and log. Notably, a<br />

thread processor comb<strong>in</strong>es four <strong>of</strong> its stream cores (exclud<strong>in</strong>g the transcendental<br />

one) to process double-precision operations. In addition, a branch execution<br />

unit is conta<strong>in</strong>ed <strong>in</strong> each thread processor to handle branch executions. Tab. 1<br />

summarizes the HD4870’s specifications.<br />

In the AMD stream comput<strong>in</strong>g model, a stream denotes a collection <strong>of</strong> data<br />

elements <strong>of</strong> the same type that can be operated on <strong>in</strong> parallel. A kernel is a<br />

parallel function that operates on each element <strong>of</strong> an output stream. An <strong>in</strong>stance<br />

<strong>of</strong> kernel execution on a thread processor is called a thread. Threads<br />

are mapped to thread processors for execution and scheduled <strong>in</strong> wavefronts. A<br />

Table 1. AMD’s HD4870 Specification<br />

Thread Processors 800 Memory Clock Speed 993 MHz<br />

Texture Units 40 Memory Interface 256 bits<br />

Core Clock Speed 750 MHz Memory Bandwidth 115 GB/s<br />

Memory Type GDDR5 S<strong>in</strong>gle (Double) Peak 1.2 T (240G) flops

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!