Tsubame2.0 - GPU Technology Conference

  • No tags were found...

Tsubame2.0 - GPU Technology Conference

TSUBAME 2.0 から TSUBAME 2.5、3.0、 更 にはエクサへの 進 化TSUBAME2.0 to 2.5, Towards 3.0and Exascale松 岡 聡 ・ 東 京 工 業 大 学Satoshi MatsuokaTokyo Institute of Technology


2006: TSUBAME1.0as No.1 in JapanAll University CentersCOMBINED 45 TeraFlops>Total 85 TeraFlops,#7 Top500 June 2006 Earth Simulator40TeraFlops #1 2002~2004

Modern IDC Server Speedups- x2 in 2 years“Performance per server andperformance per thousanddollars of purchase costdouble every two years or so,which tracks thetypical doubling time fortransistors on a chip predictedby the most recent incarnationofMoore’s law”Only x32 in 10 years ->c.f. HPC x1000 in 10 yearsx30 discrepancySource: Assessing trends over time in performance, costs, and energyuse for servers, Intel, 2009.

Peak Performance [GFLOPS]Memory Bandwidth [GByte/s]Performance Comparison ofCPU vs. GPU1750 GPU 200GPU150012501601000120750500250CPU8040CPU0x5-6 socket-to-socket advantage in bothcompute and memory bandwidth,Same power(200W GPU vs. 200W CPU+memory+NW+…)0

TSUBAME2.0 Nov. 1, 2010“The Greenest Production Supercomputer in the World”TSUBAME 2.0New Development32nm40nm>400GB/s Mem BW>1.6TB/s Mem BW >12TB/s Mem BW80Gbps NW BW35KW Max~1KW max7>600TB/s Mem BW220Tbps NWBisecion BW1.4MW Max

2010: TSUBAME2.0 as No.1 in JapanTotal 2.4 Petaflops#4 Top500, Nov. 2010>AllOther JapaneseCenters on the Top500COMBINED 2.3 PetaFlops

TSUBAME Wins Awards…“Greenest ProductionSupercomputer in theWorld”the Green 500Nov. 2010, June 2011(#4 Top500 Nov. 2010)

TSUBAME Wins Awards…ACM Gordon Bell Prize 2011Special Achievements in Scalability and Time-to-Solution“Peta-Scale Phase-Field Simulation for DendriticSolidification on the TSUBAME 2.0 Supercomputer”


Precise Bloodflow Simulation of Artery onTSUBAME2.0(Bernaschi et. al., IAC-CNR, Italy)Personal CT Scan + Simulation=> Accurate Diagnostics of Cardiac Illness5 Billion Red Blood Cells + 10 Billion Degrees ofFreedom

MUPHY: Multiphysics simulation of blood flow(Melchionna, Bernaschi et al.)Combined Lattice-Boltzmann (LB)simulation for plasma and MolecularDynamics (MD) for Red Blood CellsRealistic geometry ( from CAT scan)Fluid: Blood plasmaLattice BoltzmannMultiphyics simulationwith MUPHY softwareIrregular mesh is divided byusing PT-SCOTCH tool,considering cutoff distancecoupledBody: Red blood cellExtended MDRed blood cells(RBCs) arerepresented asellipsoidal particlesTwo-levels of parallelism: CUDA (onGPU) + MPI• 1 Billion mesh node for LBcomponent•100 Million RBCsACMGordon BellPrize 2011HonorableMention4000 GPUs,0.6Petaflops

Industry prog.: TOTO INC.TSUBAME 150 GPUsIn-House Cluster

アステラス 製 薬 とのデング 熱 等 の 熱 帯 病 の特 効 薬 の 創 薬Accelerate Insilicoscreeninigand data mining

100-million-atom MD SimulationM. Sekijima (Tokyo Tech), Jim Phillips (UIUC)

Mixed Precision Amber on Tsubame2.0for Industrial Drug Discoveryx10 fasterMixed-Precisionヌクレオソーム (25095 粒 子 )$500mil~$1bil dev. cost perdrug75% Energy EfficientEven 5-10% improvement ofthe process will more thanpay for TSUBAME

TSUBAME 増 強 の 必 要 性 についてTSUBAME2.0 繁 忙 期 の 状 況 ( 平 成 24 年 2 月 2 日 )99%のノード 稼 働 率TSUBAME 増 強 による 効 果TSUBAME2.0は、 既 に 繁 忙 期 の 需 要 が 供 給 を 上 回 り、 大 きな 機 会 損 失 。特 にジョブのノード 数 の 規 模 が 大 きく、 実 行 時 間 が 長 い 防 災 シミュレーションや 次 世 代 産 業 利 用 アプリでは、 繁 忙 期 の 実 行 がほぼ 不 可 能TSUBAME2.5( 単 精 度 17ペタ)への 増 強 により、 繁 忙 期 における 機 会 損 失の 解 消 、および 防 災 ・ 次 世 代 産 業 の 大 規 模 アプリ 実 行 が 可 能 な 環 境 を 実現 することにより、 安 心 安 全 な 国 づくりに 寄 与 できる。TSUBAME2.0 繁 忙 期 のジョブのノード 数 ( 平 成 24 年 1 月 )ノード 数 (x) ジョブ 数 割 合 (%)x = 1 476,162 96.841< x

But it is not simple…• 資 金 は? How do we get the funds• どのメニーコアアクセラレータ?Which Many-CoreAccelerator• 動 くのか?Will it work in the first place• 単 精 度 演 算 強 化 に 意 味 があるか StrengtheningSingle Precision Useful?• ネットワークや 他 の 部 分 はボトルネックにならないか?Won’t other parts e.g. networks become abottleneck?• 効 率 は 出 るのか Will it be efficient?• 電 力 はどうか、 特 に 設 備 のリミット 内 なのか How ispower, within the power limit?

月 別 総 ノード 時 間 (h)ASUCA Weather145TeraFlopsWorld RecordTsubame2.5へのアップグレードの 説 明 ( 各 所 )Tsubame 2.0’s Achievements4 th Fastest Supercomputer inthe World (Nov. 2010 Top500)TSUBAME2.0 As of Dec. 20, 2011DendriteCrystallization2.0 PetaFlopsBlood Flow 600TeraFlopsTSUBAME2.0における 利 用 者 種 別 による 使 用 時 間 の 集 計 (2012/02 末 時 点 )1. 2011 種 々の Gordon 科 学 技 2011 術 的 Gordon 成 果 Bell → Fruit 本 of 学 Years のシンボルの of Collaborative Research 一 つ900,000Bell Award!!Award Honorable– Info-Plosion, JST CREST, Ultra LowMention800,000Power HPC…2. TSUBAME2.0の 容 量 の 早 期 パンク700,000Over 10 Petascale Applications600,000つまり、パンク 寸 前NVIDIA GPUTSUBAME2.0G200(55nm)Fermi (40nm)2008Q22010Q2TSUBAME2.02010Q4Intel Xeon Phi4.8/2.4PFFerry(45nm)2010Q2Intel XeonNehalem(45nm)Harperton2009Q3Westmere (32nm)(45nm) 2007Q4 2010Q2Kepler2 (28nm)2012Q3TSUBAME2.52013Q2-Q317PF/5.7PFXeon Phi (22nm)2012Q4x66,000fasterx3 powerefficient>>MaxwellXeon Phi (KNL)稼 働 可 能 時 期 が2015 年 中 IvyBridgeEP 盤 以 HaswellEP 降 へBroadwellEPSandyBridgeEP (22nm)2012Q1 (32nm) 2013Q2TSUBAME1.22008Q4#1 SFP/DFPJapanTSUBAME2.0 更 新#1 SFP/ #2 DFPJapanTSUBAME3.0TSUBAME3.02015Q3 >75/25PF3. TSUBAME3.0の 想 定 プロセッサの 遅 れ500,000400,000300,000200,000100,0000~2000 スパコンユーザ90% 以 上 の 利 用 率50~60% GPUのジョブ 内 利 用 率11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 22010 年 2011 年 2012 年4. 他 機 関 スパコンの 急 速 な 発 展国 内 は3~4 位 、 世 界 では20 位 以 下 に学共共共共情FY2008 2009 2010 2011 2012 2013 2014 2015SFP/DFP Floating Point Perf.京 2012Q2 11/11PF

どのようにアップグレード?How do we upgrade?どのメニーコア?Which Many Core Processor?

Processor SPECIFICATIONSXeon E5-2670Xeon Phi5110PTeslaM2050TeslaK20cTeslaK20XGeForceGTX TitanPerformance(SP)333GFlops2022GFlops1030GFlops3520GFlops3.95GFlops4500GFlopsMemoryBandwidth51.2 GB/s320 GB/s(ECC off)148 GB/s(ECC off)208 GB/s(ECC off)250 GB/s(ECC off)288 GB/s(nonECC)Memory Size --- 8 GB 3 GB 5 GB 6 GB 6 GBnumber ofcores8 60/61 448 2496 2688 2688Clock Speed 2.6 GHz 1.053 GHz 1.15 GHz 0.706 GHz 0.732 GHz 0.837 GHz25

Intel Xeon Phi CoprocessorHigh Performance ProgrammingAuthors: James Jeffers, James Reinders (Intel)nkjibnkjitnkjinnkjisnkjiwnkjienkjicnkjinkjinkjinkjinkjinkjinkjinkjinkjinkjinkjifcfcfcfcfcfcfczfffyfffxffftff1,,1,,1,,1,,,1,,1,,,21,,,,1,,21,,,,1,,2,1,,,,1,,,1,, 222 Discretization\1,, kjf i1,, kjf ikjif ,1,kjif ,1,kjif ,, 1kjif ,, 1kjif ,,2222222221ztccytccxtcczyxtcbtnswec

Performance [GFLOPS]Performances300.0250.0200.0■ standard + SIMD■ peeling boundary■ + tiling■ standard■ register reuse in the z-direction■ + shared memory in the x-direction200184233206276169163150.0142142100.080.210511211011612450.028.933.214.50.0Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz5110P Tesla M2050 Tesla K20c Tesla K20X GeForce GTX Titan27

Performance[GFlops]CPU, MIC, GPU PerformancesKepler GPU Tuning• Shared Memory Use in the x-directional kernel.• Super function unit• Loop unrolling• Variable reuse in the y- and z- loops• Reduction of branch diverges1,4001,20012181,00080086695860045740020053.6 59.5 57.10XeonE5-2600Xeon Phi3110PXeon Phi5110PTeslaM2050TeslaK20CTeslaK20XGeForceGTX Titan

単 精 度 演 算 強 化 に 意 味 があるかStrengthening Single Precision Useful?今 まで 見 せたアプリは 全 て 単 精 度 ( 中 心 )All the applications we saw were actually insingle precision (dominant)ネットワークや 他 の 部 分 はボトルネックにならないか?Won’t other parts e.g. networksbecome a bottleneck?

SL390 vs. SL250 Platform DifferencesSL390: Westmere GenerationSL250: Sandy Bridge GenerationMemDDR3CPU0QPIIOH0PCIeGPU0MemDDR3CPU0PCIeGPU0MemDDR3CPU1QPIIOH1PCIeGPU1MemDDR3CPU1PCIeGPU1BandwidthLimitationバンド 幅 の 制 限PCIeGPU2PCIeGPU2SL390s G7SL250s Gen830© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

NVIDIA LINPACK on 1 Node with 3*K20X (by HP)3050 GFLOPS vs. 1380 GFLOPOS on Node LinpackGPU CPU Varies 5280 to 2010 MB/s in ourmeasurements on SL390G731Will it affect real applications?© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

3500 Fiber Cables > 100Kmw/DFB Silicon PhotonicsEnd-to-End 7.5GB/s, > 2usNon-Blocking 200Tbps BisectionNEC Confidential

TSUBAME2.0 vs. Earth Simulator1( 地 球 シミュレータ)nodeMem:NW~23:1RCU/MNU (NW)Main Memory (16GB)AP 250GB/sAP AP8 AP8 AP8 APGF 8 APGF 8GF 8GF 8GF GF GF AP8GFTotal 64GFES1 NodeTo Crossbar NW10.8GB/s (up/down)QDR IBx2~=7.5GB/s(up/down)IB QDRNEC ConfidentialNVIDIA FermiM2050515GFx3PCI-e 6.5GB/s x 3IB QDRHCAHCAIOHGDDR5 Mem9 GB (3GBx3)270GB/s totalSSD 120GB300MB/sMem:NW~37:16 Core x86 CPU70GF- OS- Services- Sparse access16-32GB Mem30-43GB/sTSUBAME2.0 Node

TSUBAME2.0 vs. Earth Simulator1( 地 球 シミュレータ)nodeMem:NW~23:1RCU/MNU (NW)Main Memory (16GB)AP 250GB/sAP AP8 AP8 AP8 APGF 8 APGF 8GF 8GF 8GF GF GF AP8GFTotal 64GFES1 NodeTo Crossbar NW10.8GB/s (up/down)QDR IBx2~=7.5GB/s(up/down)IB QDRNEC ConfidentialNVIDIA KeplerK20X1310GFx3PCI-e 6.5GB/s x 3IB QDRHCAHCAIOHGDDR5 Mem18 GB (6GBx3)570GB/s totalSSD 120GB300MB/sMem:NW~76:16 Core x86 CPU70GF- OS- Services- Sparse access54GB Mem30-43GB/sTSUBAME2.5 Node

高 解 像 度 気 象 計 算 の 必 要 性• ゲリラ 豪 雨 突 発 的 で 局 地 的 な 豪 雨 (~km - ~10km) 参 考 : 梅 雨 前 線 などによる 集 中 豪 雨 (~100km)現 業 の5km 以 下 の 格 子 で 計 算 することが 必 須•ASUCA Production Code 気 象 庁 によって 開 発 が 進 められる 次 世 代高 解 像 度 気 象 シミュレーションコードCompressible nonhydrostatic equationsFlux form, Generalized coordinateTSUBAME2.0で 全 系 ・4000GPUで 実 行 、 気 象 計算 として 世 界 記 録[Shimokawabe, Aoki, et.al.SC2010]http://www.nikkeibp.co.jp/sj/2/column/z/33/http://trendy.nikkeibp.co.jp/lc/photorepo/080916_photo/

Current Weather Forecast5km Resolution(Inaccurate Cloud Simulation)ASUCA Typhoon Simulation500m Resolution 4792×4696×48 , 437 GPUs(x1000)36

Multi-GPU: Boundary exchange• GPUs and CPUs with MPIGPUNode 1BoudaryNode 2CPU(1) GPU→CPU(2) CPU → CPU(3) CPU→GPUyxGPUs cannot directly access data stored on global memoryof other GPUs

Optimization for Multi-GPU Computing• Overlapping Computation and CommunicationCalculation for one variableComputationBoundary exchange(Commnication)TimeHow to Overlap?• Optimization (1) Inter-variable Overlapping (2) Kernel division Overlapping (3) Fused Kernel Overlapping

Breakdown of Computational Time (SP)in ASUCA Weather CodeTSUBAME 2.0 (256 x 208 x 48 on each GPU)CommunicationComputationMomentum (x)Total communication timeOverlapping (3 divided Kernels)Non-Overlapping (Single Kernel)TSUBAME 1.2 (320 x 256 x 48 on 528GPUs)CommunicationComputationTotal communication timeMomentum (x)

Bandwidth Reducing Stencil Computation[IEEE Ashes2013, Cluster 2013]

Even Oversubscribed Full Bisection FatTree (a.k.a. Tsubame2) can beInefficientCore SwitchEdge SwitchEdge Switch• Load inbalance in the links to the core switch becomes a bottleneck• Increased inbalance with increased # nodes• Failures/low performance links worsen the situation substantially• Especially manifesting in collectives

Scalable Multi-GPU 3-D FFT forTSUBAME 2.0 Supercomputer [SC12]Standard (Multi-Rail) OpenMPIOne GPU per nodeOur Algorithm: IBverbs + DynamicRails SelectionConcurrent cudaMemCopy and MPI send/rcv One GPU per node2048 3 FFT 256 nodes / 768 GPUs => 4.8 Tflops(Faster than 8000 node K Computer w/Takahashi’s volumetric 3-D FFT)Effective 6 times faster c.f. Linpack comparison

Towards TSUBAME 3.0Interim Upgrade TSUBAME2.0 to 2.5(Early Fall 2013)• Upgrade the TSUBAME2.0s GPU(Fermi 2050) to the latestAccelerator (e.g., Kepler K20X, Xeon Phi,…)SFP/DFP peak from4.8PF/2.4PF => 17PF/5.7PFor greaterc.f. The K Computer 11.2/11.2TSUBAME2.0 Compute NodeFermi GPU 3 x 1408 = 4224 GPUsSignificant CapacityImprovement at low cost& w/oPower IncreaseTSUBAME3.0 2H2015Acceleration of Important AppsConsiderable ImprovementSummer 2013

2013: TSUBAME2.5 No.1 in Japan inSingle Precision FP, 17 PetaflopsAll University Centers~=COMBINED 9 Petaflops SFPTotal 17PetaflopsSFP5.7 Petaflops DFPK Computer11.4 Petaflops SFP/DFP

TSUBAME2.0 TSUBAME2.5Thin Node x 1408 台Node Machine HP Proliant SL390s ← 変 更 なし No changeCPUIntel Xeon X5670← 変 更 なし No change(6core 2.93GHz, Westmere) x 2GPU NVIDIA Tesla M2050 x 3• 448 CUDA cores (Fermi)‣ 単 精 度 SFP 1.03TFlops‣ 倍 精 度 DFP .515TFlops• 3GiB GDDR5 memory• ~90GB/s STREAM BW 実 測メモリバンド 幅ノード 性 能 NodePerformance(incl. CPUTurbo boost)理 論 演 算 性 能Total PeakPerformance• 単 精 度 SFP 3.40TFlops• 倍 精 度 DFP 1.70TFlops• ~300GB/s STREAM BW 実測 メモリバンド 幅TOTAL System• 単 精 度 SFP 4.80PFlops• 倍 精 度 DFP 2.40PFlops• 実 測 メモリバンド 幅~440TB/sNVIDIA Tesla K20X x 3• 2688 CUDA cores (Kepler)‣ 単 精 度 SFP 3.95TFlops‣ 倍 精 度 DFP 1.31TFlops• 6GiB GDDR5 memory• ~180GB/s STREAM BW 実 測 メモリバンド 幅• 単 精 度 SFP 12.2TFlops• 倍 精 度 DFP 4.08TFlops• ~570GB/s STREAM BW 実 測 メモリバンド 幅• 単 精 度 SFP 17.1PFlops (x3.6 倍 )• 倍 精 度 DFP 5.76PFlops (x2.4 倍 )• 実 測 メモリバンド 幅 ~803TB/s(x1.8 倍 )

TSUBAME2.5: FastestSupercomputer in Japan*, Fall 2013• (*) But NOT in Linpack (^_^)– In single-precision peak performance• Comparison w/K-computer (in SFP)– Peak FLOPS: ~16 P vs. ~11 P– Peak FMOPS (FP Memory Op/s): ~0.25P vs. ~0.4P– Peak Power: 1.7 MW vs. 17 MW• Q: Real scalable apps SFP/ mixed-precisionFP /integer?– Environment/disaster, Medical/Life Science,Industrial

TSUBAME Evolution25PFGraph 500No. 3 (2011)2013 Q3 or Q4All the GPU will bereplaced by newaccelerators6PFTSUBAME 2.5 willhave 15-17 PFlopsIn single precisionPerformance.3.0AwardsCopyright ©Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

Focused Research TowardsTsubame 3.0 and Beyond towards Exa• Green Computing: Ultra Power Efficient HPC• High Radix Bisection Networks – HW, Topology, RoutingAlgorithms, Placement…• Fault Tolerance – Group-based HierarchicalCheckpointing, Fault Prediction, Hybrid Algorithms• Scientific “Extreme” Big Data – Ultra Fast I/O, HadoopAcceleration, Large Graphs• New memory systems – Pushing the envelops of lowPower vs. Capacity vs. BW, exploit the deep hierarchywith new algorithms to decrease Bytes/Flops• Post Petascale Programming – OpenACC and other manycoreprogramming substrates, Task Parallel• Scalable Algorithms for Many Core –Apps/System/HW Co-Design

MachinePower(incl.cooling)LinpackPerf(PF)LinpackMFLOPs/WFactorTotal MemBW TB/s(STREAM)Earth Simulator 1 10MW 0.036 3.6 13,400 160 16Tsubame1.0(2006Q1)ORNL Jaguar(XT5. 2009Q4)Tsubame2.0(2010Q4)K Computer(2011Q2)BlueGene/Q(2012Q1)TSUBAME2.5(2013Q3)Tsubame3.0(2015Q4~2016Q1)Mem BWMByte/S/ W1.8MW 0.038 21 2,368 13 7.2~9MW 1.76 196 256 432 481.8MW 1.2 667 75 440 244x31.6~16MW 10 625 80 3300 206~12MW? 17 ~1400 ~35 3000 2501.4MW ~3 ~2100 ~24 802 5721.5MW ~20 ~13,000 ~4 6000 4000~x20x34~x13.7EXA (2019~20) 20MW 1000 50,000 1 100K 5000

JST-CREST “Ultra Low Power (ULP)-HPC” Project2007-2012Ultra Multi-CoreSlow & Parallel(& ULP)ULP-HPCNetworksULP-HPC SIMD-Vector(GPGPU, etc.)MRAMPRAMFlashetc.Power Optimize using Novel Componentsin HPC0Low PowerHigh Perf ModelAuto-Tuning for Perf. & Powerモデルと 実 測 の Bayes 的 融 合• Bayes モデルと 事 前 分 布2y ~ N( , )2 T 2 | , ~ N(x , / )i22 2 ~ Inv - ( v , )i• n 回 実 測 後 の 事 後 予 測 分 布2y | ( y , y , , y ) ~ t ( , n,n2 2 nni00i1 i20ii n,n0inii ni0i00in n1T ( x ny ) / nin0 iモデルによる所 要 時 間 の 推 定所 要 時 間 の 実 測 データ/ )ni2T 2( ym yi) 0n(yi xi ) / n1yi ynOptimizationnPointPowerABCLibScript: アルゴリズム 選 択実 行 起 動 前 自 動 チューニング 指 定 、!ABCLib$ static select region startアルゴリズム 選 択 処 理 の 指 定!ABCLib$ parameter (in CacheS, in NB, in NPrc)コスト 定 義 関 数 で 使 われる!ABCLib$ select sub region start入 力 変 数!ABCLib$ according estimated!ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc)コスト 定 義 関 数対 象 1(アルゴリズム1)!ABCLib$ select sub region end!ABCLib$ select sub region start対 象 領 域 1、2!ABCLib$ according estimated!ABCLib$ (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc)対 象 2(アルゴリズム2)!ABCLib$ select sub region end!ABCLib$ static select region end1Perfx10 PowerEfficienctyPower-Aware and Optimizable Applications0Perf. ModelAlgorithmsx1000Improvement in 10 years

Aggressive Power Saving in HPCMethodologiesEnterprise/BusinessCloudsHPCServer Consolidation Good NG!DVFS(Dynamic Voltage/FrequencyScaling)Good PoorNew DevicesNew HW &SWArchitectureNovel CoolingPoor(Cost & Continuity)Poor(Cost & Continuity)Limited(Cost & Continuity)GoodGoodGood(high thermal density)

How do we achive x1000?Process Shrink x100XMany-Core GPU Usage x5XDVFS & Other LP SW x1.5XEfficient Cooling x1.4x1000 !!!ULP-HPCProject2007-12Ultra GreenSupercomputingProject 2011-15

Statistical Power Modeling of GPUspi1 i c iAverage power consumption• Prevents overtraining by ridge regression• Determines optimal parameters by crossfitting[IEEE IGCC10]GPU performance counters • Estimates GPU power consumption GPUnstatistically• Linear regression model using performancecounters as explanatory variablesHigh accuracy(Avg Err 4.7%)Power meter withhigh resolutionAccurate even with DVFSFuture: Model-based power opt.Linear model shows sufficient accuracyPossibility of optimization of Exascale systemswith O(10^8) processors

Mflops/Watt (single precision)Ultra Low Power HPC Towards TSUBAME3.02006TSUBAME12008TSUBAME1.21 st GPU SC2010TSUBAME2.0Greenest Production20123.0 Proto1, 2, …Year Machine PeakGFlops2006T1.02008T1.2Estim.AppGFlopsKWattEstim.Mflops/WSun X4600 160 128 1.3 100Sun X4600 +Tesla S1070 2GPUs1300 630 1.7 37010000010000BG/Q1000ULPHPCMooreIntel report2010T2.0HP SL390s withTesla M2050 3GPUs3200 1400 1.0 14002012ProtoSandy Bridge EP w/8 Kepler GPUs20000 7000 1.1 >60001002004200620082010Year2012201420162018

Power Efficiency in Denderite Applicationson TSUBAME1.0 thru JST-CREST ULPHPCPrototype running Gordon Bell Denderite App1680 倍 の 見 込 みGT200 からK20への 外 挿 ラインFermiGT200K10K2010 年 で1000 倍 ライン

TSUBAME KFC Project (H1 2013 to H1 2016)~200TFlops SC and cooling facility in a single container!Oil cooling + Outside air cooling + SC with high densityOil-submersion rack:Heat movesfrom GPUs (80~90℃)to oil (35~45℃)Heat exchanger:Heat movesfrom oil (35~45℃)to water (25~35℃)Outside air~200 Kepler-2 GPUs• ~600TF in SP• ~200TF in DP Standard 20 feetcontainerEvaporativecooling tower:Heat movesfrom water (25~35℃)to outside air

TSUBAME-KFCTowards TSUBAME3.0 and BeyondShooting for #1 on Nov. 2013 Green 500!

The current “Big Data” are notreally that Big…今 の「ビッグデータ」はビッグではない• Typical “real” definition: “Mining people’s privacy data tomake money”「 企 業 がプライバシーをマイニングして 金 儲 け」• Corporate data Gigabytes~Terabytes, seldom Petabytes.せいぜいギガ~テラバイト 級– Processing involve simple O(n) algorithms, or those that canbe accelerated with DB-inherited indexing algorithms 処 理 や処 理 量 も 少 ない• Executed on re-purposed commodity “web” servers linkedwith 1Gbps networks running Hadoop/HDFS ウェブ 用 のサーバのHadoop 程 度• Vicious cycle of stagnation in innovations…このままでは 進 歩がない• ⇒ Convergence with Supercomputing with ExtremeBig Data スパコンとの「コンバージェンス」による 次 世 代ビッグデータ

Extreme Big Data in GenomicsImpact of new generation sequencers[Slide Courtesy YutakaAkiyama @ Tokyo Tech.]Sequencing data (bp)/$becomes x4000 per 5 yearsc.f., HPC x33 in 5 yearsLincoln Stein, Genome Biology, vol. 11(5), 2010

TSUBAME2.0 Storage OverviewTSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)Infiniband QDR Network for LNET and Other ServicesQDR IB(×4) × 20 QDR IB (×4) × 810GbE × 2GPFS#1 GPFS#2 GPFS#3 GPFS#4SFA10k #1SFA10k #2SFA10k #3SFA10k #4SFA10k #5HOMEHOME/work9 /work0 /work19 /gscr0SystemapplicationSFA10k #6iSCSI“Global Work Space” #1“Global WorkSpace” #2“Global WorkSpace” #3“Scratch”“cNFS/Clusterd Samba w/ GPFS”“NFS/CIFS/iSCSI byBlueARC”GPFS with HSMLustreParallel File System Volumes3.6 PBHome Volumes1.2PB2.4 PB HDD +〜4PB Tape“Thin node SSD”“Fat/Medium node SSD”250 TB, 300TB/sScratch130 TB=> 500TB~1PBGrid Storage

TSUBAME2.0 Storage OverviewTSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)Infiniband QDR Network for LNET and Other ServicesQDR IB(×4) × 20 QDR IB (×4) × 810GbE × 2SFA10k #1SFA10k #2“Global Work Space” #1SFA10k #3“Global WorkSpace” #2SFA10k #4“Global WorkSpace” #3SFA10k #5/work9 /work0 /work19 /gscr0GPFS with HSM2.4 PB HDD +〜4PB Backup TapeConcurrent Parallel I/O(e.g. MPI-IO)Read mostly I/O(data-intensive apps, parallel workflow,parameter survey)Long-TermLustreParallel File System Volumes“Thin node SSD”Scratch3.6 PB“Scratch”Fine-grained R/W I/O(checkpoints, temporary files,Big Data processing)“Fat/Medium node SSD”250 TB, 300GB/sGPFS#1 GPFS#2 GPFS#3 GPFS#4• Home storage for computing nodes• HOME Cloud-based campus storage HOMEservicesSystemapplication“cNFS/Clusterd Samba w/ GPFS”SFA10k #6Home VolumesData transfer servicebetween SCs/CCs130 TB=> 500TB~1PBHPCI StorageiSCSI“NFS/CIFS/iSCSI byBlueARC”1.2PB

But what does “220Tbps” mean?Global IP Traffic, 2011-2016 (Source Cicso)2011 2012 2013 2014 2015 2016 CAGR2011-2016By Type (PB per Month / Average Bitrate in Tbps)FixedInternetManaged IPMobiledataTotal IPtraffic23,288 32,990 40,587 50,888 64,349 81,347 28%71.9 101.8 125.3 157.1 198.6 251.16,849 9,199 11,846 13,925 16,085 18,131 21%21.1 28.4 36.6 43.0 49.6 56.0597 1,252 2,379 4,215 6,896 10,804 78%1.8 3.9 7.3 13.0 21.3 33.330,734 43,441 54,812 69,028 87,331 110,282 29%94.9 134.1 169.2 213.0 269.5 340.4TSUBAME2.0 Network has TWICEthe capacity of the Global Internet,being used by 2.1 Billion usersNEC Confidential

Five years ago, data center networks were here(slide from Dan Reed@MS->Iowa U)• Historical hierarchical data centernetwork structure– (Mostly) driven by economics– (Partially) driven by workloads– Performance limited• Now moving to “flat”(sound familiar?)– From N-S to E-W• Challenges (then and now)– Configuration and testing– Monitoring and resilience– Service demand variance• Workload redistribution• Service drain times– Compatibility (see IPv6 transition)– LAN/WAN separation• Data islands, geo-resilience and scale– Performance and cost, Cost, COST• Did I mention cost?InternetLayer 3Layer 2SLB…BRInternetBRAR AR … AR ARSSSS…LBSKey:• BR (L3 Border… Router)• AR (L3 AccessRouter)• S (L2 Switch)• LB (Load Balancer)

“convergence” at future extreme scalefor computing and data (in Clouds?)Source: Assessing trends overtime in performance, costs, and energyuse for servers, Intel, 2009.HPC: x1000 in 10 yearsCAGR ~= 100%IDC: x30 in 10 yearsServer unit sales flat(replacement demand)CAGR ~= 30-40%

Kronecker graphA: 0.57, B: 0.19C: 0.19, D: 0.05Graph500 “Big Data” Benchmark• The benchmark is ranked by so-called TEPS(Traversed Edges Per Second) that measures thenumber of edges to be traversed per second bysearching all the reachable vertices from onearbitrary vertex with each team’s optimized BFS(Breadth-First Search) algorithm.November 15, 2010Graph 500 Takes Aim at a New Kind of HPCRichard Murphy (Sandia NL => Micron)“ the goal of the Graph 500 benchmark is tomeasure the performance of a computer solving alarge-scale "informatics" problem…(for)cybersecurity, medical informatics, dataenrichment, social networks, and symbolicnetworks.”“ I expect that this ranking may at times look verydifferent from the TOP500 list. Cloud architectureswill almost certainly dominate a major chunk ofpart of the list, and we may find that some exoticarchitectures dominate the top.”

TSUBAME Wins Awards…The Graph 500“Big Data Benchmark”#3--Nov. 2011, #4--June 2012

Multi GPU Implementation with Reduction ofData Transfer using Graph Cut [IEEE CCGrid13]• Investigation of effect of GPU toMapReduce type graph algorithm– Comparison with existing implementation• Existing CPU implementation• Optimized implementation not usingMapReduce• Handling extremely large-scale graph– Increase amount of memory using MultiGPU• Reduce amount of data transfer– As one of the solution, Partition the graph aspreprocessing and reduce amount of inter-nodedata transfer on Shuffle– Utilize local storage in addition to memory• Load data in turn from filesystem and move toGPUs• Schedule effective data placement

KEdges / SecOutperform Hadoop-based Implementation• PEGASUS: a Hadoop-based GIM-V implementation– Hadoop 0.21.0– Lustre for underlying Hadoop’s file systemBetter100000SCALE 27, 128 nodes100001000186.8xSpeedup100101PEGASUS MarsCPU MarsGPU-3

TSUBAME3 ExtremeBig Data Prototype71

Summary まとめ• TSUBAME1.0->2.0->2.5->3.0->…– Number 1 in Japan, 17 Petaflops SFP, but not in Linpack「 日 本 一 」の「 伝 統 」は 受 け 継 がれる• TSUBAME3.0 2015-2016– New supercomputing leadership– 新 世 代 のスパコンとして 更 に 新 たな 世 界 の 技 術 リーダーシップ• Lots of background R&D for TSUBAME3.0 andtowards Exascale– Green Computing グリーンコンピューティングTSUBAME-KFC– Extreme Big Data 次 世 代 ビッグデータ– Exascale Resilience エクサスケールの 耐 故 障 性– Many Core Programming メニーコアプログラミング– 。。。• Please stay tuned! 乞 うご 期 待 。 応 援 をお 願 いします。

More magazines by this user
Similar magazines