Jet Engine Aerodynamics on GPUs - many-core.group - University of ...

NVIDIA ® TESLAJET ENGINE AERODYNAMICSON GPUsDr. Graham PullanUniversity of Cambridge

• CFD(forturbomachinery)• AgoodfitforGPUs?• ImplementaAon• ResultsOutline

TurbomachineryThousands of bladesArranged in rowsEach blade row has abespoke blade profiledesigned with CFDBlade row

CFDofajetenginefanBlades coloured bypressure

IntroducAontoCFDBladeFlowDivide the volume into cells

GoverningequaAonsforeachcell

GoverningequaAonsforeachcellConserve:• Mass• Momentum• Energy

Example:massconservaAon• EvaluatemassfluxesoneachfaceF mass= A 4∑ρV n€

Example:massconservaAon• SumfluxesonfacestofinddensitychangeincellΔρ cell=∑F mass€

• UpdatedensityExample:massconservaAonΔρ node= 1 8∑Δρ cell€(only 4 of 8 surrounding cells shown)

SimilarityofstepsEach step uses data from surrounding nodes – “stencil” operation

• ForeachequaAon(5inall):SimilarityofequaAons– Setrelevantflux(mass,momentum,energy)– Sumfluxes– Updatenodes– (plussmoothing–alsostencilboundarycondiAons–notstencil)

CPUrunAmes(x86machines)SteadyapproximaAon–onebladeperrow1blade 0.5Mcells 1CPUhour1stage(2blades) 1.0Mcells 3CPUhours1component(5stages) 5.0Mcells 20CPUhoursUnsteadyapproximaAon–allbladesinrow1component(1000blades) 500Mcells 0.1MCPUhours<strong>Engine</strong>(4000blades) 2Gcells 1MCPUhours

PeakFLOPs

ThepurposeofGPUs

GraphicsandscienAficcompuAngGPUsaredesignedtoapplythesameshadingfunc,ontomanypixelssimultaneously

GraphicsandscienAficcompuAngGPUsaredesignedtoapplythesamefunc,ontomanydatasimultaneously

• OurCFDcodeis:AreGPUsagoodfitforCFD?– SIMD(samefuncAonsappliedtoallcellsindomain)– Singleprecision– Largedatasets(c10Mnodes)fitonone4GBTeslacard• (bandwithoncardishighc100GB/smuchslowerto/fromcardc8GB/sandstepsinCFDare“memorybound”)

Pre‐CUDA–GPUcoercionApplication specifies geometry – GPUrasterizesEach fragment is shaded (SIMD)Shading can use values from memory(textures)Courtesy, John Owens, UC DavisImage can be stored for re-use

Pre‐CUDA–GPUcoercionDraw a quadRun a SIMD program over eachfragmentGather is permitted from texture memoryCourtesy, John Owens, UC DavisResulting buffer can be stored for re-use

CUDA• GraphicsabstracAonisremoved• Scalarvariables(notgraphics‐type4‐vectors!)• ExtensionstoC(notgraphicsAPIs,egOPENGL)

CUDA• GraphicsabstracAonisremoved• Scalarvariables(notgraphics‐type4‐vectors!)• ExtensionstoC(notgraphicsAPIs,egOPENGL)• BUT–porAng15,000linesofexisAngFORTRANCFDcodetoCUDAsAllalengthytask

• DivideupdomainOverallstrategy– eachsub‐domaintoathreadblock– updatenodesinsub‐domainwithmostefficientstenciloperaAonwecancomeupwith!– updatesub‐domainboundaries(MPIifneeded)

EfficientstenciloperaAons• Launchonethreadperelementinani‐kplane• Loadenoughplanesintosharedmemoryasneededbystencil• Updateelementsinplane(storeinglobaldevicememory)• Loadnew(i‐k)plane–repeat,iterateinjdirecAon

CUDAexample__global__ void smooth_kernel(float sf, float *a_data, float*b_data){/* shared memory array */__shared__ float a[16][3][5];/* fetch first planes */a[i][0][k] = a_data[i0m10];a[i][1][k] = a_data[i000];a[i][2][k] = a_data[i0p10];__syncthreads();/* compute */b_data[i000] =sf1*a[i][1][k] + sfd6*(a[im1][1][k] +a[ip1][1][k] + a[i][0][k] +a[i][2][k] + a[i][1][km1] + a[i][1][kp1])/* load next "j" plane and repeat ...*/

SBLOCK–stencilframework• SBLOCKframeworkforstenciloperaAonsonstructuredgrids:– Source‐to‐sourcecompiler• TakesinhighlevelkerneldefiniAons• ProducesopAmisedkernelsinCorCUDA• Allowsnewstencilstobeimplementedquickly• AllowsnewstencilopAmisaAonstrategiestobedeployedonallstencils(withouttypos!)

ExampleSBLOCKdefiniAonkind = "stencil"bpin = ["a"]bpout = ["b”]lookup = ((1,0, 0), (0, 0, 0), (1,0, 0), (0, 1,0),(0, 1, 0), (0, 0, 1), (0, 0, 1))calc = {"lvalue": "b","rvalue": """sf1*a[0][0][0] +sfd6*(a[1][0][0] + a[1][0][0] +a[0][1][0] + a[0][1][0] +a[0][0][1] + a[0][0][1])"""}

CimplementaAonvoid smooth(float sf, float *a, float *b){for (k=0; k < nk; k++) {for (j=0; j < nj; j++) {for (i=0; i < ni; i++) {/* compute indices i000, im100, etc */b[i000] = sf1*a[i000] +sfd6*(a[im100] + a[ip100] +a[i0m10] + a[i0p10]+ a[i00m1] + a[i00p1]);}}}}

CUDAimplementaAon__global__ void smooth_kernel(float sf, float *a_data, float*b_data){/* shared memory array */__shared__ float a[16][3][5];/* fetch first planes */a[i][0][k] = a_data[i0m10];a[i][1][k] = a_data[i000];a[i][2][k] = a_data[i0p10];__syncthreads();/* compute */b_data[i000] =sf1*a[i][1][k] + sfd6*(a[im1][1][k] +a[ip1][1][k] + a[i][0][k] +a[i][2][k] + a[i][1][km1] + a[i][1][kp1])/* load next "j" plane and repeat ...*/

Turbostream• CUDAportofexisAngFORTRANcode(TBLOCK)• 15,000linesFORTRAN• 5,000lineskerneldefiniAons‐>30,000linesofCUDA• RunsonCPUormulApleGPUs• 20xspeeduponTeslaC1060ascomparedtoallcoresofamodernIntelcore2quad.

Turbostream• 9minutesonaTeslaS870(4GPUs)• 12hoursonone2.5GHzCPUcore

FORTRAN&CUDAcomparisonFortranCUDA

ImpactofGPUacceleratedCFD• TeslaPersonalSupercomputerenables– Fullturbinein10minutes(not12hours)– Oneblade(fordesign)in2minutes• Teslaclusterenables– InteracAvedesignofbladesforfirstAme– Useofhigheraccuracymethodsatearlystageindesignprocess

Summary• ManyscienceapplicaAonsfittheSIMDmodelusedinGPUs• CUDAenablessciencedeveloperstoaccesstoNVIDIAGPUswithoutcumbersomegraphicsAPIs• ExisAngcodeshavetobeanalysedandre‐codedtobestfitthemany‐corearchitecture• Thespeedupsaresuchthatthiscanbeworthdoing• ForourapplicaAon,thestep‐changeincapabilityisrevoluAonary

MoreinformaAonwww.many-core.group.cam.ac.uk

Jet Engine Aerodynamics on GPUs - many-core.group - University of ...

Create successful ePaper yourself

Delete template?

Save as template?