12.07.2015 Views

Shekhar Borkar, Intel Corp. Exascale Computing—a fact or ... - IPDPS

Shekhar Borkar, Intel Corp. Exascale Computing—a fact or ... - IPDPS

Shekhar Borkar, Intel Corp. Exascale Computing—a fact or ... - IPDPS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OutlineCompute roadmap & technology outlookChallenges & solutions f<strong>or</strong>:– Compute,– Mem<strong>or</strong>y,– Interconnect,– Resiliency, and– Software stackSummary2


Relative Proc FreqFrom Giga to Exa, via Tera & Peta1000Peta1.E+08Exa100101Tera30X 250XG1986 1996 2006 20161.E+061.E+041.E+021.E+00GTera36XPeta2.5M X4,000XConcurrencyProcess<strong>or</strong> Perf<strong>or</strong>mance1986 1996 2006 20161.E+081.E+061.E+041.E+021.E+00PetaTeraG 80XPower1M X4,000XExa1986 1996 2006 2016System perf<strong>or</strong>manceincreases fasterParallelism continues toincreasePower & energy challengecontinues4


The UHPC* Challenge*DARPA, Ubiquitous HPC Program20MW, Exa20W, Tera20KW, Peta20 mW, Mega20 pJ/Operation20 mW, Giga2W, 100 Giga6


RelativeRelativeRelativeRelativeTechnology Scaling Outlook40Transist<strong>or</strong> Density1.5Frequency3020101.75 – 2X10.5Almost flat045nm 32nm 22nm 14nm 10nm 7nm5nm045nm 32nm 22nm 14nm 10nm 7nm5nm1.21Supply Voltage1Energy0.80.6Almost flat0.1Some scaling0.40.20.01Ideal045nm 32nm 22nm 14nm 10nm 7nm5nm0.00145nm 32nm 22nm 14nm 10nm 7nm5nm7


Energy (pJ)Energy per Compute Operation300250200100 pJ/bitCommunicationpJ/bit CompJ/bit DRAMDP RF OppJ/DP FP15075 pJ/bitDRAM100Operands50FP Op10 pJ/bit25 pJ/bit045nm 32nm 22nm 14nm 10nm 7nmSource: <strong>Intel</strong>8


N<strong>or</strong>malizedVoltage Scaling1.2When designed to voltage scale1010.80.60.40.2Energy EfficiencyLeakageFreqTotal Power864200.3 0.5 0.7 0.90Vdd (N<strong>or</strong>mal)9


Maximum Frequency (MHz)4 OrdersTotal Power (mW)Energy Efficiency (GOPS/Watt)Subthreshold Region< 3 OrdersActive Leakage Power (mW)Near Threshold-Voltage (NTV)10 465nm CMOS, 50°C10 245040065nm CMOS, 50°C10 110 310 1350300112502009.6X15010 -110 110 2 10 -210 -11001320mV0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)H. Kaul et al, 16.6: ISSCC08500320mV0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)10 -210


SD Leakage PowerSubthreshold Leakage at NTV60%50%40%30%20%10%0%40% VddIncreasingVariations50% Vdd75% Vdd100% Vdd45nm 32nm 22nm 14nm 10nm 7nm5nmNTV operation reduces total power, improves energy efficiencySubthreshold leakage power is substantial p<strong>or</strong>tion of the total11


Mitigating Impact of Variation1.Variation control with body biasingBody effect is substantially reduced in advancedtechnologiesEnergy cost of body biasing could become substantialFully-depleted transist<strong>or</strong>s have no body left2. Variation tolerance at the system levelExample: Many-c<strong>or</strong>e Systemfffff/2f/4ff/2fffff/4 f f/2 f/4ffffff/2f/4fRunning all c<strong>or</strong>es at fullfrequency exceeds energybudgetRun c<strong>or</strong>es at the native frequencyLaw of large numbers—averaging12


ENERGYEFFICENCYWide Dynamic RangeHIGHSubthresholdNTVN<strong>or</strong>mal operating range~5xDemonstratedLOWZEROVOLTAGEMAXUltra-low Power Energy Efficient High Perf<strong>or</strong>mance280 mV 0.45 V 1.2 V3 MHz 60 MHz 915 MHz2 mW 10 mW 737 mW1500 Mips/W 5830 Mips/W 1240 Mips/W14


Observations100%80%60%40%Mem LkgMem DynLogic LkgLogic Dyn20%0%Sub-Vt NTV FullVddLeakage power dominatesFine grain leakage power management is required15


Efficiency [%]Integration of Power DeliveryF<strong>or</strong> efficiency and managementIntegrated Voltage Regulat<strong>or</strong> Testchip5mmConverterchipLoad chipRFLaunchStandard OLGAPackagingTechnology9085L = 1.9nHL = 0.8nH2.4V to 1.5VTOP80BOTTOMInduct<strong>or</strong>sInputCapacit<strong>or</strong>sOutputCapacit<strong>or</strong>s757060MHz80MHz2.4V to 1.2V100MHz0 5 10 15 20Load Current [A]Power delivery closer to the load f<strong>or</strong>1. Improved efficiency2. Fine grain power managementSchrom et al, “A 100MHz 8-Phase Buck Converter Delivering 12A in 25mm2 Using Air-C<strong>or</strong>e Induct<strong>or</strong>s”, APEC 200716


Fine-grain Power ManagementMode Power Saving Wake upN<strong>or</strong>mal All active - -StandbySleepLogic offMem<strong>or</strong>y onLogic andMem<strong>or</strong>y offDynamic Chip LevelSTANDBY:• Mem<strong>or</strong>y retains data• 50% less power/tileFULL SLEEP:• Mem<strong>or</strong>ies fully off• 80% less power/tile50% Fast80% SlowDynamic, within a c<strong>or</strong>e21 sleep regions per tileData Mem<strong>or</strong>ySleeping:57% less powerRouterInstructionMem<strong>or</strong>ySleeping:56% less powerRouterSleeping:10% less power(stays on topass traffic)FPEngine 1Sleeping:90% lesspowerFPFPEngine 2Sleeping:90% lesspowerEnergy efficiency increases by 60%,Vangal et al, An 80-Tile Sub-100-W TeraFLOPS Process<strong>or</strong> in 65-nm CMOS, JSSC, January 200817


Mem<strong>or</strong>y & St<strong>or</strong>age Technologies10000010000Cost/bit(Pico $)1000100101Energy/bit(Pico Joule)Capacity(G Bit)0.10.01SRAM DRAM NAND, PCM Disk(Endurance issue)Source: <strong>Intel</strong>18


Compare Mem<strong>or</strong>y TechnologiesDRAM f<strong>or</strong> first level capacity mem<strong>or</strong>yNAND/PCM f<strong>or</strong> next level st<strong>or</strong>ageSource: <strong>Intel</strong>19


3D-Integration of DRAM and LogicLogic Buffer ChipTechnology optimized f<strong>or</strong>:High speed signalingEnergy efficient logic circuitsImplement intelligenceDRAM StackTechnology optimized f<strong>or</strong>:Mem<strong>or</strong>y densityLower costLogic BufferPackage3D Integration provides best of both w<strong>or</strong>lds


1Tb/s HMC DRAM Prototype• 3D integration technology• 1Gb DRAM Array• 512 MB total DRAM/cube• 128GB/s Bandwidth•


Energy/bit (pJ/bit)Communication Energy10010Betweencabinets10.1On DieOn DieBoard to BoardChip to chipBoard to Board0.010.1 1 10 100 1000Interconnect Distance (cm)23


On-die Interconnect1.21Compute energy0.80.60.4On die IC energy0.2090 65 45 32 22 14 10 7Source: <strong>Intel</strong>Technology (nm)Interconnect energy (per mm) reduces slower than computeOn-die data movement energy will start to dominate24


21.72mmDDR3 MCDDR3 MC21.4mmDDR3 MCDDR3 MCNetw<strong>or</strong>k On Chip (NoC)80 C<strong>or</strong>e TFLOP Chip (2006)12.64mmI/O Areasingle tile1.5mm2.0mm48 C<strong>or</strong>e Single Chip Cloud (2009)TILE26.5mmCCCCPLLTILEJTAGPLLI/O AreaTAPClockdist.11%8 X 10 Mesh32 bit links320 GB/sec bisectionBW @ 5 GHzDualFPMACs36%VRCSystem Interface + I/O2 C<strong>or</strong>e clusters in 6 X 4 Mesh(why not 6 x 8?)128 bit links256 GB/sec bisection BW @ 2 GHzMC &DDR3-80019%C<strong>or</strong>es70%Router +Links28%10-p<strong>or</strong>tRF4%IMEM +DMEM21%Routers& 2Dmesh10%GlobalClocking1%25


Delay (ps)pJ/BitOn-chip Interconnects1000001000010001000.3u pitch, 0.5VRepeated wire delayRouter DelayWire Delay0 5 10 15 20 25Length (mm)Wire EnergyRouter Energy0.90.80.70.60.50.40.30.20.1026


Circuit Switched NoCClock8x8 NoCI/OTraffic Generationand MeasurementP<strong>or</strong>tsC<strong>or</strong>eN<strong>or</strong>thSouthEastWestNarrow, high freq packet switchednetw<strong>or</strong>k establishes a circuitData is transferred using wide andslower established circuit switched busDifferential, low swing bus improvesenergy efficiencyProcess45nm Hi-K/MGCMOS, 9 Metal CuNominal Supply 1.1VArbitration andRouter LogicNumber ofTransist<strong>or</strong>sSupp<strong>or</strong>ts 512b data2.85MDie Area 6.25mm 22 to 3X increased energy efficiency over packet switched netw<strong>or</strong>kAnders et al, A 4.1Tb/s bisection-bandwidth 560Gb/s/W streaming circuit-switched 8×8 mesh netw<strong>or</strong>k-on-chip in 45nm CMOS, ISSCC 201027


Hierarchical & HeterogeneousCCCCCCBus RBus RBusCCCCCCCCCCBus to connect oversh<strong>or</strong>t distancesCBus R2 nd Level BusCCBus RCHierarchy of BussesOr hierarchical circuitand packet switchednetw<strong>or</strong>ks28


Electrical Interconnect < 1 Meter1000100Energy (pJ/bit)10Data rate (Gb/sec)1Source: ISSCC papers0.11.2u 0.5u 0.25u 0.13u 65nm 32nmBW and Energy efficiency improves,but not enough29


Energy (pJ/bit)Electrical Interconnect AdvancesTop of the packageconnect<strong>or</strong>Employ, new, low-loss, non-traditional interconnectsLow-loss flexconnect<strong>or</strong>43State of the artLow-loss twinax210FlexHDITwinax01010-300-6-9-10-10-0.5 -0.25 0 0.25 0.5-0.5 -0.25 0 0.25-120.50 100 200 300 400Channel length (cm)Co-optimization of interconnectsand circuits f<strong>or</strong> energy efficiencyO’Mahony et al, “A 47x10Gb/s 1.4mW/(Gb/s) Parallel Interface in 45nm CMOS “, ISSCC 2010; and J. Jaussi, RESS004, IDF 2010


Straw-man System InterconnectAssume: 40 Gbps, 10 pJ/b, $0.6/Gbps, 8B/FLOP, naïve tapering1,000fiberseach. ..8,000fiberseach. ..$35M217 MW............... 35 Clusters.......Clusterof 35~0.8 PF................ 35 Clusters32


Byte/FLOPDataMovementPower (MW)DataMovementPower (MW)Bandwidth Tapering1.E+021.E+011.E+001.E-011.E-021.E-031.E-041.E-05246.658 Byte/Flop total1.130.19Severe0.03Naïve, 4X0.00450.00051.5110001001010.1Total DM Power = 217 MWL1 L2 Chip Board Cab Sys0.00005 Total DM Power = 3 MWC<strong>or</strong>e L1 L20.5Chip Board Cab L1SysL2Sys0L1 L2 Chip Board Cab Sys<strong>Intel</strong>ligent BW tapering is necessary33


Road to Unreliability?From Peta to ExaReliability Issues1,000X parallelism M<strong>or</strong>e hardware f<strong>or</strong> something to go wrong>1,000X intermittent faults due to soft err<strong>or</strong>sAggressive Vcc scalingto reduce power/energyDeeply scaledtechnologiesGradual faults due to increased variationsM<strong>or</strong>e susceptible to Vcc droops (noise)M<strong>or</strong>e susceptible to dynamic temp variationsExacerbates intermittent faults—soft err<strong>or</strong>sAging related faultsLack of burn-in?Variability increases dramaticallyResiliency will be the c<strong>or</strong>ner-stone34


n-SER/cell (sea-level)Relative to 130nmSoft Err<strong>or</strong>s and Reliability10.80.60.40.2065nm90nm130nm180nm250nm0.5 1 1.5 2Voltage (V)Soft err<strong>or</strong>/bit reduces each generationNominal impact of NTV on soft err<strong>or</strong> rate101Assuming 2X bit/latchcount increase pergenerationLatchMem<strong>or</strong>y180 130 90 65 45 32Positive impact of NTV on reliabilityTechnology (nm)Soft err<strong>or</strong> at the system level willcontinue to increaseLow V lower E fields, low power lower temperatureDevice aging effects will be less of a concernLower electromigration related defectsN. Seifert et al, "Radiation-Induced Soft Err<strong>or</strong> Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 200635


ResiliencyFaultsExamplePermanent faults Stuck-at 0 & 1Gradual faultsVariabilityTemperatureIntermittent faults Soft err<strong>or</strong>sVoltage droopsAging faultsDegradationFaults cause err<strong>or</strong>s (data & control)Datapath err<strong>or</strong>s Detected by parity/ECCSilent data c<strong>or</strong>ruption Need HW hooksControl err<strong>or</strong>s Control lost (Blue screen)Minimal overhead f<strong>or</strong> resiliencyApplicationsSystem SoftwareProgramming systemMicrocode, Platf<strong>or</strong>mMicroarchitectureCircuit & DesignErr<strong>or</strong> detectionFault isolationFault confinementReconfigurationRecovery & Adapt36


Architecture needs a Paradigm ShiftArchitect’s past and present pri<strong>or</strong>ities—Single thread perf<strong>or</strong>manceProgramming productivityConstraintsFrequencyLegacy, compatibilityArchitecture features f<strong>or</strong> productivity(1) Cost(2) Reasonable Power/EnergyArchitect’s future pri<strong>or</strong>ities should be—Throughput perf<strong>or</strong>mancePower/EnergyConstraintsParallelism, application specific HWArchitecture features f<strong>or</strong> energySimplicity(1) Programming productivity(2) CostMust revisit and evaluate each (even legacy)architecture feature37


HW-SW Co-designApplications and SW stackprovide guidance f<strong>or</strong> efficientsystem designApplicationsSystem SWProgramming SysArchitectureCircuits & DesignLimitations, issues andopp<strong>or</strong>tunities to exploit38


RelativePerf<strong>or</strong>manceBottom-up GuidanceSD Leakage PowerMW1. NTV reduces energy but exacerbatesvariationsSmall & Fast c<strong>or</strong>esRandom distributionTemp dependent2. Limited NTV f<strong>or</strong> arrays (mem<strong>or</strong>y)due to stability issues3. On-die Interconnect energy (permm) does not reduce as much ascompute1.210.80.60.40.20Mem<strong>or</strong>yComputeVoltageCompute EnergyInterconnect Energy45 32 22 14 10 7Technology (nm)Disprop<strong>or</strong>tionateMem<strong>or</strong>y arrayscan be madelarger6X compute1.6X interconnect4. At NTV, leakage power is substantialp<strong>or</strong>tion of the total power60%50%40%30%20%10%0%40% Vdd50% Vdd75% Vdd100% VddIncreasingVariations45nm 32nm 22nm 14nm 10nm 7nm 5nmExpect 50%leakageIdle hardwareconsumes energy5. DRAM energy scales, but not enough1000100101DRAM Energy (pJ/b)3D Hybrid Mem<strong>or</strong>y Cube90nm 65nm 45nm 32nm 22nm 14nm 10nm 7nm50 pJ/b today8 pJ/bdemonstratedNeed < 2pJ/b6. System interconnect limited by laserenergy and cost30025020015010050040 Gbps Photonic links @ 10 pJ/bData Movement Power SystemCabinetBoardsDieClustersIslands45nm 32nm 22nm 14nm 10nm 7nm 5nmBW tapering andlocality awarenessnecessary39


Today’s HW System ArchitectureSmallRFC<strong>or</strong>eX8 FP32K I$32K D$128K I$Process<strong>or</strong>…X 16…~ 3GHz~100 WSmallRFC<strong>or</strong>eX8 FP32K I$32K D$128K I$4GBDRAMSocketSocket4GBDRAM4GBDRAMSocketSocket4GBDRAMCoherent domain1.5 TF Peak660 pJ/F 660 MW/Exa10 mB of DRAM/F128K D$16 MB L3128K D$.Non-coherent domain384 GF/s Peak260 pJ/F 260 MW/Exa55 mB of local mem<strong>or</strong>y/FToday’s programming modelcomprehends this systemarchitecture40


Service C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eLogicService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eStraw-man <strong>Exascale</strong> Process<strong>or</strong>Simplest C<strong>or</strong>eRF*Logic600K Transist<strong>or</strong>sFirst level of hierarchyC C C CShared CacheC C C CNext level of hierarchyPE PE PE PE ………..1MB L2PE PE PE PE………..PE PE PE PE1MB L2PE PE PE PEInterconnectNext level cache………..PE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2………..PE PE PE PEProcess<strong>or</strong>Next level of hierarchyPE PE PE PE ………..1MB L2PE PE PE PE………..PE PE PE PE1MB L2PE PE PE PEInterconnectNext level cache………..PE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2Next level of hierarchyPE PE PE PE ………..1MB L2PE PE PE PE………..PE PE PE PE1MB L2PE PE PE PE………..InterconnectNext level cache………..………..Next level of hierarchyPE PE PE PE ………..1MB L2PE PE PE PE………..PE PE PE PEPE PE PE PEPE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2………..………..InterconnectLast level cache………..1MB L2InterconnectNext level cache………..PE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2………..PE PE PE PENext level of hierarchyPE PE PE PE ………..1MB L2PE PE PE PE………..PE PE PE PEPE PE PE PEPE PE PE PE1MB L2InterconnectNext level cache………..………..PE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2………..PE PE PE PETechnology 7nm, 2018Die area500 mm2C<strong>or</strong>es 2048Frequency 4.2 GHzTFLOPs 17.2Power600 WattsE Efficiency 34 pJ/FlopComputations alone consume 34 MW f<strong>or</strong> <strong>Exascale</strong>41


LogicService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>ePE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>ePE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>ePE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>ePE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEStraw-man Architecture at NTVProcess<strong>or</strong>Simplest C<strong>or</strong>eRF*Logic600K Transist<strong>or</strong>sFirst level of hierarchyC C C CShared CacheC C C CNext level of hierarchyPE PE PE PE ………..1MB L2PE PE PE PE………..PE PE PE PE1MB L2PE PE PE PEInterconnectNext level cache………..PE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2………..PE PE PE PENext level of hierarchy………..………..InterconnectNext level cache………..Next level of hierarchy………..………..………..InterconnectNext level cache………..………..………..………..InterconnectLast level cache………..Next level of hierarchy………..………..InterconnectNext level cache………..Next level of hierarchy………..………..InterconnectNext level cache………..………..………..………..Full Vdd 50% VddTechnology 7nm, 2018Die area 500 mm2C<strong>or</strong>es 2048Frequency 4.2 GHz 600 MHzTFLOPs 17.2 2.5Power 600 Watts 37 WattsE Efficiency 34 pJ/Flop 15 pJ/FlopReduced frequencyand flopsReduced power andimproved E-efficiencyCompute energy efficiency close to <strong>Exascale</strong> goal42


Micro-Byte/FLOPsLocal Mem<strong>or</strong>y Capacity1000100101TraditionalExa-StrawmanAt NTVL1 L2 L3 L4Mem<strong>or</strong>y HierarchyHigher local mem<strong>or</strong>y capacity promotes data locality43


Interconnect StructuresBuses over sh<strong>or</strong>t distanceShared mem<strong>or</strong>yCross Bar SwitchShared busMulti-p<strong>or</strong>ted Mem<strong>or</strong>yX-Bar1 to 10 fJ/bit0 to 5mmLimited scalability10 to 100 fJ/bit1 to 5mmLimited scalability0.1 to 1pJ/bit2 to 10mmModerate scalabilityPacket Switched Netw<strong>or</strong>k Board CabinetSystemSecond LevelSwitch. ...First levelSwitch. ..SwitchCluster………………………….CabinetCluster………………………….…………………………1 to 3pJ/bit>5 mm, scalable44


Programming modelExecution modelSW Challenges1.Extreme parallelism (1000X due to Exa,additional 4X due to NTV)2.Data locality—reduce data movement3.<strong>Intel</strong>ligent scheduling—move thread todata if necessary4.Fine grain resource management(objective function)5.Applications and alg<strong>or</strong>ithms inc<strong>or</strong>p<strong>or</strong>ateparadigm change45


Programming & Execution ModelEvent driven tasks (EDT)Dataflow inspired, tiny codelets (self contained)Non blocking, no preemptionProgramming model:Separation of concerns: Domain specification & HW mappingExpress data locality with hierarchical tilingGlobal, shared, non-coherent address spaceOptimization and auto generation of EDTs (HW specific)Execution model:Dynamic, event-driven scheduling, non-blockingDynamic decision to move computation to dataObservation based adaption (self-awareness)Implemented in the runtime environmentSeparation of concerns:User application, control, and resource management46


Over-provisioning, Introspection, Self-awarenessService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>ePE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>ePE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>ePE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>eService C<strong>or</strong>ePE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEPE PE PE PE1MB L2PE PE PE PEAddressing variationsFine grain resource mgmtSens<strong>or</strong>s f<strong>or</strong> introspectionFMFFMSSSS M F MF S S S1. Provide m<strong>or</strong>e computeHW2. Law of large numbers3. Static profileFMFFMSSSS M F MF S S SDynamic reconfiguration:1. Energy efficiency2. Latency3. Dynamic resourcemanagementInterconnect8MB Shared LLC………..………..Interconnect8MB Shared LLC………..Process<strong>or</strong> Chip (16 Clusters)………..Interconnect64MB Shared LLC………..Interconnect8MB Shared LLC………..Interconnect8MB Shared LLC………..………..1. Energy consumption2. Instantaneous power3. Computations4. Data movement1. Schedule threads based on objectives and resources2. Dynamically control and manage resources3. Identify sens<strong>or</strong>s, functions in HW f<strong>or</strong> implementationSystem SW implements introspective execution model47


MWMWOver-provisioned Introspectively ResourceManaged SystemOver-provisioned in design140120100806040200Dynamically tuned f<strong>or</strong> the given objectiveSystem PowerData MovementPower w/o DMSys Goal20 MW45nm 32nm 22nm 14nm 10nm 7nm 5nm1412108642040 Gbps Photonic links @ 10 pJ/bSystemCabinetBoardsDieClustersIslandsData Movement Power3.23 MW45nm 32nm 22nm 14nm 10nm 7nm 5nm48


SummaryPower & energy challenge continuesOpp<strong>or</strong>tunistically employ NTV operation3D integration f<strong>or</strong> DRAMCommunication energy will far exceed computationData locality will be paramountRevolutionary software stack needed to make<strong>Exascale</strong> real49

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!