20.11.2014 Views

PPKE ITK PhD and MPhil Thesis Classes

PPKE ITK PhD and MPhil Thesis Classes

PPKE ITK PhD and MPhil Thesis Classes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

60<br />

2. MAPPING THE NUMERICAL SIMULATIONS OF PARTIAL<br />

DIFFERENTIAL EQUATIONS<br />

2.3.2.5 Implementation on the Cell Architecture<br />

By using the previously described discretization method, a C based CFD solver<br />

is developed which is optimized for the SPEs of the Cell architecture.<br />

The simplest way to define a body fitted mesh is to define the coordinates<br />

of the vertices. By using these coordinates, the length <strong>and</strong> the normal vector of<br />

the faces <strong>and</strong> the volume of the quadrilateral can be computed. This solution,<br />

however, requires the lowest memory b<strong>and</strong>width. This is not efficient, due to<br />

the lack of hardwired floating point divider <strong>and</strong> square root unit inside the SPE.<br />

Computing these values on the SPE requires as many instructions as computing<br />

the derivative <strong>and</strong> updating the state value of one cell. Additionally, these values<br />

are constant during the computation <strong>and</strong> can be computed in advance by the<br />

PPU. Eight constant variables are required to describe the quadrilaterals, namely<br />

the length of the interfaces, the normal vector components, the volume <strong>and</strong> the<br />

mask. Even though, using these precomputed values, the memory I/O b<strong>and</strong>width<br />

requirement of the algorithm is increased by 45%, the number of required<br />

instructions is nearly halved, therefore, the computing performance is doubled.<br />

Since the relatively small local memory of the SPEs does not allow to store<br />

all the required data, an efficient buffering method is necessary to save memory<br />

b<strong>and</strong>width. In our solution a belt of 5 rows is stored in the local memory from<br />

the array: 2 rows are required to form the local neighborhood of the currently<br />

processed row, one line is required for data synchronization, <strong>and</strong> two lines to allow<br />

the overlap of the computation <strong>and</strong> communication as shown in Figure 2.13.<br />

Additionally, 2 rows are needed to store the constant values <strong>and</strong> temporarily<br />

store the primitive variables (u, v, p) computed from the conservative variables<br />

(ρ, ρu, ρv, E). During implementation the environment of the CNN simulation<br />

kernel was used. Template operations are optimized according to the discretized<br />

equations (2.7) to improve performance. The optimized kernel requires about<br />

32KB memory from the local store of the SPE leaving approximately 224KB for<br />

the row buffers. Therefore, the length of the buffer is of maximum 1000 grid<br />

points, while the number of rows is only limited by the size of the main memory.<br />

Storing the precomputed constant values requires 27% more local store per grid<br />

point, but this overhead is affordable due to the doubled performance.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!