01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

238 F. Xudong et al.<br />

stencil computations <strong>of</strong> Mgrid, many <strong>in</strong>termediate results can be reused. Reus<strong>in</strong>g<br />

these results reduces memory fetches as well as computation. Generally, large<br />

thread granularity can improve data locality and computation <strong>in</strong>tensity, though<br />

entail<strong>in</strong>g the consumption <strong>of</strong> more GPRs. We will expla<strong>in</strong> the concept <strong>of</strong> thread<br />

granularity later.<br />

Given limited resources (such as the number <strong>of</strong> GPRs, memory bandwidth)<br />

and the amount <strong>of</strong> work (e.g., a specific kernel computation), the number <strong>of</strong><br />

threads that can be created is determ<strong>in</strong>ed bythreadgranularity.Inotherwords,<br />

the thread granularity is <strong>in</strong> <strong>in</strong>verse proportion to the number <strong>of</strong> threads. This<br />

means tun<strong>in</strong>g thread granularity can balance locality with<strong>in</strong> thread and parallelism<br />

among threads [8]. So there should be a balance between thread locality<br />

and thread parallelism through tun<strong>in</strong>g. To achieve the best performance, programmers<br />

should carefully tune the thread granularity to strike the right balance.<br />

In Brook+, the total number <strong>of</strong> threads is determ<strong>in</strong>ed by the output stream<br />

size (doma<strong>in</strong> size). Note that the thread granularity here means the number<br />

<strong>of</strong> grid po<strong>in</strong>ts a thread calculates. There are two methods to tun<strong>in</strong>g thread<br />

granularity <strong>in</strong> Brook+:<br />

(a) Us<strong>in</strong>g vector types<br />

Brook+ provides built-<strong>in</strong> short vector types for tun<strong>in</strong>g the code explicitly on<br />

the available short-SIMD mach<strong>in</strong>es. Short vectors here are built from the name <strong>of</strong><br />

their base type, with the size appended as a suffix, such as float4 and double2.<br />

Us<strong>in</strong>g vector types reduces the doma<strong>in</strong> size (the output stream length) by a<br />

factor <strong>of</strong> the vector size, and consequently <strong>in</strong>creases the thread granularity by<br />

the same factor. Take double2 for example. Us<strong>in</strong>g double2 <strong>in</strong>creases the thread<br />

granularity by a factor <strong>of</strong> two. A thread can now compute two stencil po<strong>in</strong>ts at a<br />

time, so more data reuse can be exploited through us<strong>in</strong>g the <strong>in</strong>termediate results<br />

with<strong>in</strong> each thread.<br />

Moreover, us<strong>in</strong>g vector types can pack up to four scalar fetches <strong>in</strong>to one vector<br />

fetch to form a vector fetch. Vector fetch significantly reduces memory fetches<br />

if a kernel is designed to fetch from consecutive data locations, thus mak<strong>in</strong>g<br />

more efficient use <strong>of</strong> the fetch resources. For example, a kernel can issue a float4<br />

fetch <strong>in</strong> one cycle versus four separate float fetches <strong>in</strong> four cycles. In the stencil<br />

computations <strong>of</strong> Resid where locations <strong>of</strong> the data to be fetched are usually consecutive,<br />

vector fetches naturally raise arithmetic <strong>in</strong>tensity and improve memory<br />

performance.<br />

(b) Multiple output streams<br />

Brook+ supports up to eight simultaneous output streams per kernel us<strong>in</strong>g the<br />

CAL backend [1]. Us<strong>in</strong>g multiple output streams, a thread can complete multifolds<br />

<strong>of</strong> computations with respect to us<strong>in</strong>g a s<strong>in</strong>gle output stream, which also<br />

<strong>in</strong>creases the thread granularity. For example, us<strong>in</strong>g two output streams doubles<br />

the thread granularity. If we comb<strong>in</strong>e vector types and multiple output streams<br />

together, even larger thread granularity can be atta<strong>in</strong>ed. For <strong>in</strong>stance, us<strong>in</strong>g<br />

double2 and four output streams together, we <strong>in</strong>crease the thread granularity by<br />

a factor <strong>of</strong> eight (24=8).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!