12.07.2015 Views

Jet Engine Aerodynamics on GPUs - many-core.group - University of ...

Jet Engine Aerodynamics on GPUs - many-core.group - University of ...

Jet Engine Aerodynamics on GPUs - many-core.group - University of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

NVIDIA ® TESLAJET ENGINE AERODYNAMICSON <strong>GPUs</strong>Dr. Graham Pullan<strong>University</strong> <strong>of</strong> Cambridge


• CFD(forturbomachinery)• Agoodfitfor<strong>GPUs</strong>?• ImplementaA<strong>on</strong>• ResultsOutline


TurbomachineryThousands <strong>of</strong> bladesArranged in rowsEach blade row has abespoke blade pr<strong>of</strong>iledesigned with CFDBlade row


CFD<strong>of</strong>ajetenginefanBlades coloured bypressure


IntroducA<strong>on</strong>toCFDBladeFlowDivide the volume into cells


GoverningequaA<strong>on</strong>sforeachcell


GoverningequaA<strong>on</strong>sforeachcellC<strong>on</strong>serve:• Mass• Momentum• Energy


Example:massc<strong>on</strong>servaA<strong>on</strong>• Evaluatemassfluxes<strong>on</strong>eachfaceF mass= A 4∑ρV n€


Example:massc<strong>on</strong>servaA<strong>on</strong>• Sumfluxes<strong>on</strong>facest<strong>of</strong>inddensitychangeincellΔρ cell=∑F mass€


• UpdatedensityExample:massc<strong>on</strong>servaA<strong>on</strong>Δρ node= 1 8∑Δρ cell€(<strong>on</strong>ly 4 <strong>of</strong> 8 surrounding cells shown)


Similarity<strong>of</strong>stepsEach step uses data from surrounding nodes – “stencil” operati<strong>on</strong>


• ForeachequaA<strong>on</strong>(5inall):Similarity<strong>of</strong>equaA<strong>on</strong>s– Setrelevantflux(mass,momentum,energy)– Sumfluxes– Updatenodes– (plussmoothing–alsostencilboundaryc<strong>on</strong>diA<strong>on</strong>s–notstencil)


CPUrunAmes(x86machines)SteadyapproximaA<strong>on</strong>–<strong>on</strong>ebladeperrow1blade 0.5Mcells 1CPUhour1stage(2blades) 1.0Mcells 3CPUhours1comp<strong>on</strong>ent(5stages) 5.0Mcells 20CPUhoursUnsteadyapproximaA<strong>on</strong>–allbladesinrow1comp<strong>on</strong>ent(1000blades) 500Mcells 0.1MCPUhours<str<strong>on</strong>g>Engine</str<strong>on</strong>g>(4000blades) 2Gcells 1MCPUhours


PeakFLOPs


Thepurpose<strong>of</strong><strong>GPUs</strong>


GraphicsandscienAficcompuAng<strong>GPUs</strong>aredesignedtoapplythesameshadingfunc,<strong>on</strong>to<strong>many</strong>pixelssimultaneously


GraphicsandscienAficcompuAng<strong>GPUs</strong>aredesignedtoapplythesamefunc,<strong>on</strong>to<strong>many</strong>datasimultaneously


• OurCFDcodeis:Are<strong>GPUs</strong>agoodfitforCFD?– SIMD(samefuncA<strong>on</strong>sappliedtoallcellsindomain)– Singleprecisi<strong>on</strong>– Largedatasets(c10Mnodes)fit<strong>on</strong><strong>on</strong>e4GBTeslacard• (bandwith<strong>on</strong>cardishighc100GB/smuchslowerto/fromcardc8GB/sandstepsinCFDare“memorybound”)


Pre‐CUDA–GPUcoerci<strong>on</strong>Applicati<strong>on</strong> specifies geometry – GPUrasterizesEach fragment is shaded (SIMD)Shading can use values from memory(textures)Courtesy, John Owens, UC DavisImage can be stored for re-use


Pre‐CUDA–GPUcoerci<strong>on</strong>Draw a quadRun a SIMD program over eachfragmentGather is permitted from texture memoryCourtesy, John Owens, UC DavisResulting buffer can be stored for re-use


CUDA• GraphicsabstracA<strong>on</strong>isremoved• Scalarvariables(notgraphics‐type4‐vectors!)• Extensi<strong>on</strong>stoC(notgraphicsAPIs,egOPENGL)


CUDA• GraphicsabstracA<strong>on</strong>isremoved• Scalarvariables(notgraphics‐type4‐vectors!)• Extensi<strong>on</strong>stoC(notgraphicsAPIs,egOPENGL)• BUT–porAng15,000lines<strong>of</strong>exisAngFORTRANCFDcodetoCUDAsAllalengthytask


• DivideupdomainOverallstrategy– eachsub‐domaintoathreadblock– updatenodesinsub‐domainwithmostefficientstenciloperaA<strong>on</strong>wecancomeupwith!– updatesub‐domainboundaries(MPIifneeded)


EfficientstenciloperaA<strong>on</strong>s• Launch<strong>on</strong>ethreadperelementinani‐kplane• Loadenoughplanesintosharedmemoryasneededbystencil• Updateelementsinplane(storeinglobaldevicememory)• Loadnew(i‐k)plane–repeat,iterateinjdirecA<strong>on</strong>


CUDAexample__global__ void smooth_kernel(float sf, float *a_data, float*b_data){/* shared memory array */__shared__ float a[16][3][5];/* fetch first planes */a[i][0][k] = a_data[i0m10];a[i][1][k] = a_data[i000];a[i][2][k] = a_data[i0p10];__syncthreads();/* compute */b_data[i000] =sf1*a[i][1][k] + sfd6*(a[im1][1][k] +a[ip1][1][k] + a[i][0][k] +a[i][2][k] + a[i][1][km1] + a[i][1][kp1])/* load next "j" plane and repeat ...*/


SBLOCK–stencilframework• SBLOCKframeworkforstenciloperaA<strong>on</strong>s<strong>on</strong>structuredgrids:– Source‐to‐sourcecompiler• TakesinhighlevelkerneldefiniA<strong>on</strong>s• ProducesopAmisedkernelsinCorCUDA• Allowsnewstencilstobeimplementedquickly• AllowsnewstencilopAmisaA<strong>on</strong>strategiestobedeployed<strong>on</strong>allstencils(withouttypos!)


ExampleSBLOCKdefiniA<strong>on</strong>kind = "stencil"bpin = ["a"]bpout = ["b”]lookup = ((1,0, 0), (0, 0, 0), (1,0, 0), (0, 1,0),(0, 1, 0), (0, 0, 1), (0, 0, 1))calc = {"lvalue": "b","rvalue": """sf1*a[0][0][0] +sfd6*(a[1][0][0] + a[1][0][0] +a[0][1][0] + a[0][1][0] +a[0][0][1] + a[0][0][1])"""}


CimplementaA<strong>on</strong>void smooth(float sf, float *a, float *b){for (k=0; k < nk; k++) {for (j=0; j < nj; j++) {for (i=0; i < ni; i++) {/* compute indices i000, im100, etc */b[i000] = sf1*a[i000] +sfd6*(a[im100] + a[ip100] +a[i0m10] + a[i0p10]+ a[i00m1] + a[i00p1]);}}}}


CUDAimplementaA<strong>on</strong>__global__ void smooth_kernel(float sf, float *a_data, float*b_data){/* shared memory array */__shared__ float a[16][3][5];/* fetch first planes */a[i][0][k] = a_data[i0m10];a[i][1][k] = a_data[i000];a[i][2][k] = a_data[i0p10];__syncthreads();/* compute */b_data[i000] =sf1*a[i][1][k] + sfd6*(a[im1][1][k] +a[ip1][1][k] + a[i][0][k] +a[i][2][k] + a[i][1][km1] + a[i][1][kp1])/* load next "j" plane and repeat ...*/


Turbostream• CUDAport<strong>of</strong>exisAngFORTRANcode(TBLOCK)• 15,000linesFORTRAN• 5,000lineskerneldefiniA<strong>on</strong>s‐>30,000lines<strong>of</strong>CUDA• Runs<strong>on</strong>CPUormulAple<strong>GPUs</strong>• 20xspeedup<strong>on</strong>TeslaC1060ascomparedtoall<strong>core</strong>s<strong>of</strong>amodernIntel<strong>core</strong>2quad.


Turbostream• 9minutes<strong>on</strong>aTeslaS870(4<strong>GPUs</strong>)• 12hours<strong>on</strong><strong>on</strong>e2.5GHzCPU<strong>core</strong>


FORTRAN&CUDAcomparis<strong>on</strong>FortranCUDA


Impact<strong>of</strong>GPUacceleratedCFD• TeslaPers<strong>on</strong>alSupercomputerenables– Fullturbinein10minutes(not12hours)– Oneblade(fordesign)in2minutes• Teslaclusterenables– InteracAvedesign<strong>of</strong>bladesforfirstAme– Use<strong>of</strong>higheraccuracymethodsatearlystageindesignprocess


Summary• ManyscienceapplicaA<strong>on</strong>sfittheSIMDmodelusedin<strong>GPUs</strong>• CUDAenablessciencedeveloperstoaccesstoNVIDIA<strong>GPUs</strong>withoutcumbersomegraphicsAPIs• ExisAngcodeshavetobeanalysedandre‐codedtobestfitthe<strong>many</strong>‐<strong>core</strong>architecture• Thespeedupsaresuchthatthiscanbeworthdoing• ForourapplicaA<strong>on</strong>,thestep‐changeincapabilityisrevoluA<strong>on</strong>ary


MoreinformaA<strong>on</strong>www.<strong>many</strong>-<strong>core</strong>.<strong>group</strong>.cam.ac.uk

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!