20.11.2014 Views

PPKE ITK PhD and MPhil Thesis Classes

PPKE ITK PhD and MPhil Thesis Classes

PPKE ITK PhD and MPhil Thesis Classes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.3 Computational Fluid Flow Simulation on Body Fitted Mesh Geometry<br />

with IBM Cell Broadb<strong>and</strong> Engine <strong>and</strong> FPGA Architecture 61<br />

LOAD<br />

COMPUTE<br />

TEMP<br />

RESULT<br />

STORE<br />

Figure 2.13: Local store buffers<br />

To utilize the power of the Cell architecture computation work should be<br />

distributed between the SPEs. In spite of the large memory b<strong>and</strong>width of the<br />

architecture, the memory bus can be easily saturated. Therefore, an appropriate<br />

arrangement of data between SPEs can greatly improve computing performance.<br />

One possible solution is to distribute grid data between the SPEs. In this case<br />

each SPE work on a narrow horizontal slice of the grid, similarly to the first row<br />

of SPEs in Figure 2.14.<br />

Though the above data arrangement is well suited for the architecture of array<br />

processors <strong>and</strong> simplifies the inter-processor communication, the eight SPEs<br />

access the main memory in parallel, which might require very high memory b<strong>and</strong>width.<br />

If few instructions are executed on large data sets, the memory system is<br />

saturated resulting in low performance. One possible solution for this problem<br />

is to form a pipeline using the SPEs to compute several iterations in parallel as<br />

shown in Figure 2.14. In this case continuous data flow <strong>and</strong> synchronization are<br />

required between the neighboring SPEs but this communication pattern is well<br />

suited for the ring structure of the EIB.<br />

The static timing analysis of the optimized CFD solver kernel showed that<br />

a grid point can be updated in approximately 250 clock cycles. Each update<br />

requires moving 64byte data (4x4byte conservative state value, 4x4byte updated<br />

state value, 8x4byte constant value) between the main memory <strong>and</strong> the local<br />

store of the SPE. The Cell processor runs on 3.2GHz clock frequency, therefore,<br />

in an ideal case the expected performance of the computation kernel using one

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!