Cache Usage Tutorial - MGNet
Cache Usage Tutorial - MGNet
Cache Usage Tutorial - MGNet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Part IArchitectures and FundamentalsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 4
Architectures and fundamentalsWhy worry about performance- an illustrative exampleFundamentals of computer architecture• CPUs, pipelines, superscalarity• Memory hierarchyBasic efficiency guidelinesProfilingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 5
How fast should a solver be?(just a simple check with theory)Poisson problem can be solved by a multigridmethod in < 30 operations per unknown(known since late 70’s)More general elliptic equations may needO(100) operations per unknownA modern CPU can do 1-6 GFLOPSSo we should be solving 10-60 millionunknowns per secondShould need O(100) Mbytes of memoryProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 6
How fast are solvers today?Often no more than 10,000 to 100,000 unknownspossible before the code breaksIn a time of minutes to hoursNeeding horrendous amounts of memoryEven state of the art codesare often very inefficientProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 7
Comparison of solvers(what got me started in this business ~ '95)Compute Time in Seconds1010.10.01Unstructured codeSORStructured grid MultigridOptimal MultigridProf. Craig C. DouglasUniversity of Kentucky andYale University1024 4096 16384HiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 8
Elements of CPU architectureModern CPUs are• Superscalar: they can execute more than oneoperation per clock cycle, typically:oo4 integer operations per clock cycle plus2 or 4 floating-point operations (multiply-add)• Pipelined:ooFloating-point ops take O(10) clock cycles to completeA set of ops can be started in each cycle• Load-store: all operations are done on data inregisters, all operands must be copied to/frommemory via load and store operationsCode performance heavily dependent on compiler(and manual) optimizationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 9
PipeliningProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 10
Pipelining (cont’d)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 11
CPU trendsEPIC (similar to VLIW) (IA64)Multi-threaded architectures (Alpha, Pentium4HT) Multiple CPUs on a single chip (IBM Power 4)Within the next decade• Billion transistor CPUs (today 200 milliontransistors)• Potential to build TFLOPS on a chip (e.g., SUNgraphics processors)• But no way to move the data in and out sufficientlyquickly!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 12
Memory wallLatency: time for memory to respond to a read (orwrite) request is too long• CPU ~ 0.5 ns (light travels 15cm in vacuum)• Memory ~ 50 nsBandwidth: number of bytes which can be read(written) per second• CPUs with 1 GFLOPS peak performancestandard: needs 24 Gbyte/sec bandwidth• Present CPUs have peak bandwidth
Memory acceleration techniquesInterleaving (independent memory banks storeconsecutive cells of the address space cyclically)• Improves bandwidth• But not latency<strong>Cache</strong>s (small but fast memory) holding frequentlyused copies of the main memory• Improves latency and bandwidth• Usually comes with 2 or 3 levels nowadays• But only works when access to memory is localProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 14
Principles of locality Temporal locality: an item referenced now will bereferenced again soon Spatial locality: an item referenced now indicates thatneighbors will be referenced soon <strong>Cache</strong> lines are typically 32-128 bytes with 1024being the longest recently. Lines, not words, aremoved between memory levels. Both principles aresatisfied. There is an optimal line size based on theproperties of the data bus and the memorysubsystem designs.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 15
<strong>Cache</strong>sFast but small extra memoryHolding identical copies of main memoryLower latencyHigher bandwidth Usually several levels (2, 3, or 4)Same principle as virtual memoryMemory requests are satisfied from• Fast cache (if it holds the appropriate copy):<strong>Cache</strong> Hit• Slow main memory (if data is not in cache):<strong>Cache</strong> MissProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 16
Typical cache configurationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 17
<strong>Cache</strong> issuesUniqueness and transparency of the cacheFinding the working set (what data is kept incache)Data consistency with main memoryLatency: time for memory to respond to aread (or write) requestBandwidth: number of bytes that can be read(written) per secondProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 18
<strong>Cache</strong> issues (cont’d)<strong>Cache</strong> line size• Prefetching effect• False sharing (cf. associativity issues)Replacement strategy• Least Recently Used (LRU)• Least Frequently Used (LFU)• RandomTranslation lookaside buffer (TLB)• Stores virtual memory page translation entries• Has effect similar to another level of cache• TLB misses are very expensiveProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 19
Effect of cache hit ratioThe cache efficiency is characterized by thecache hit ratio, the effective time for a dataaccess isThe speedup is then given byProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 20
<strong>Cache</strong> effectiveness depends onthe hit ratioHit ratios of 90% and better are neededfor good speedupsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 21
<strong>Cache</strong> organizationNumber of cache levelsSet associativityPhysical or virtual addressingWrite-through/write-back policyReplacement strategy (e.g., Random/LRU)<strong>Cache</strong> line sizeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 22
<strong>Cache</strong> associativity Direct mapped (associativity = 1)oEach cache block can be stored in exactly onecache line of the cache memoryFully associativeoA cache block can be stored in any cache line Set-associative (associativity = k)oEach cache block can be stored in one of kplaces in the cacheDirect mapped and set-associative caches give rise toconflict misses.Direct mapped caches are faster, fully associative cachesare too expensive and slow (if reasonably large).Set-associative caches are a compromise.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 23
Typical architectures IBM Power 3:• L1 = 64 KB, 128-way set associative (funny definition, however)• L2 = 4 MB, direct mapped, line size = 128, write back IBM Power 4 (2 CPU/chip):• L1 = 32 KB, 2-way, line size = 128• L2 = 1,5 MB, 8-way, line size = 128• L3 = 32 MB, 8-way, line size = 512 Compaq EV6 (Alpha 21264):• L1 = 64 KB, 2-way associative, line size= 32• L2 = 4 MB (or larger), direct mapped, line size = 64 HP PA-RISC:• PA8500, PA8600: L1 = 1.5 MB, PA8700: L1 = 2.25 MB• no L2 cache!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 24
Typical architectures (cont’d) AMD Athlon (from “Thunderbird” on):• L1 = 64 KB, L2 = 256 KB Intel Pentium 4:• L1 = 8 KB, 4-way, line size = 64• L2 = 256 KB up to 2MB, 8-way, line size = 128 Intel Itanium:• L1 = 16 KB, 4-way• L2 = 96 KB, 6-way• L3: off-chip, size varies Intel Itanium 2 (McKinley / Madison):• L1 = 16 / 32 KB• L2 = 256 / 256 KB• L3: 1.5 or 3 / 6 MBProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 25
Basic efficiency guidelinesChoose the best algorithmUse efficient librariesFind good compiler optionsUse suitable data layoutsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 26
Choose the best algorithmExample: Solution of linear systems arisingfrom the discretization of a special PDEGaussian elimination (standard): n 3 /3 opsBanded Gaussian elimination: 2n 2 opsSOR method: 10n 1.5 opsMultigrid method: 30n opsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 27
Choose the best algorithm(cont‘d)For n large, the multigrid method will alwaysoutperform the others, even if it is badlyimplementedFrequently, however, two methods haveapproximately the same complexity, and thenthe better implemented one will winProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 28
Use efficient librariesGood libraries often outperform own softwareClever, sophisticated algorithmsOptimized for target machineMachine-specific implementationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 29
Sources for librariesVendor-independent• Commercial: NAG, IMSL, etc.; onlyavailable as binary, often optimized forspecific platform• Free codes: e.g., NETLIB (LAPACK,ODEPACK, …), usually as source code,not specifically optimizedVendor-specific; e.g., cxml for HP Alpha withhighly tuned LAPACK routinesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 30
Sources for libraries (cont‘d)Many libraries are quasi-standards• BLAS• LAPACK• etc.Parallel libraries for supercomputersSpecialists can sometimes outperformvendor-specific librariesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 31
Find good compiler optionsModern compilers have numerous flags to selectindividual optimization options• -On: successively more aggressiveoptimizations, n=1,...,8• -fast: may change round-off behavior• -unroll• -arch• Etc.Learning about your compiler is usually worth it:RTFM (which may be hundreds of pages long).Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 32
Find good compiler options(cont‘d)Hints: Read man cc (man f77) or cc –help (orwhatever causes the possible options to print) Look up compiler options documented inwww.specbench.org for specific platforms Experiment and compare performance onyour own codesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 33
Use suitable data layoutAccess memory in order! In C/C++, for a 2Dmatrixdouble a[n][m];the loops should be such thatfor (i...)for (j...)a[i][j]...In FORTRAN, it must be the other way roundApply loop interchange if necessary (see below)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 34
Use suitable data layout (cont‘d)Other example: array mergingThree vectors accessed together (in C/C++):double a[n],b[n],c[n];can often be handled more efficiently by usingdouble abc[n][3];In FORTRAN again indices permutedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 35
ProfilingSubroutine-level profiling• Compiler inserts timing calls at thebeginning and end of each subroutine• Only suitable for coarse code analysis• Profiling overhead can be significant• E.g., prof, gprofProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 36
Profiling (cont‘d)Tick-based profiling• OS interrupts code execution regularly• Profiling tool monitors code locations• More detailed code analysis is possible• Profiling overhead can still be significantProfiling using hardware performance monitors• Most popular approach• Will therefore be discussed next in more detailProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 37
Profiling: hardware performancecountersDedicated CPU registers are used to countvarious events at runtime:Data cache misses (for different levels)Instruction cache missesTLB missesBranch mispredictionsFloating-point and/or integer operationsLoad/store instructionsEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 38
Profiling tools: DCPIDCPI = Digital Continuous ProfilingInfrastructure (still supported by currentowner, HP; source is even available)Only for Alpha-based machines runningTru64 UNIXCode execution is watched by a profilingdaemonCan only be used from outside the codehttp://www.tru64unix.compaq.com/dcpiProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 39
Profiling tools: valgrindMemory/thread debugger and cache profiler(4 tools). Part of KDE project: free.Run usingcachegrind programNot an intrusive library, uses hardwarecapabilities of CPUs.Simple to use (even for automatic testing).Julian Seward et alhttp://valgrind.kde.orgProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 40
Profiling tools: PCLPCL = Performance Counter LibraryR. Berrendorf et al., FZ Juelich, GermanyAvailable for many platforms (Portability!)Usable from outside and from inside the code(library calls, C, C++, Fortran, and Javainterfaces)http://www.fz-juelich.de/zam/PCLProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 41
Profiling tools: PAPIPAPI = Performance APIAvailable for many platforms (Portability!)Two interfaces:• High-level interface for simplemeasurements• Fully programmable low-level interface,based on thread-safe groups of hardwareevents (EventSets)http://icl.cs.utk.edu/projects/papiProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 42
Profiling tools: HPCToolkitHigh level portable tools for performancemeasurements and comparisons• Uses browser interface• PAPI should have looked like this• Make a change, check what happens onseveral architectures at oncehttp://www.hipersoft.rice.edu/hpctoolkitRob Fowler et al, Rice University, USAProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 43
hpcview Screen ShotProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 44
HPCToolkit Philosophy 1Intuitive, top down user interface forperformance analysis• Machine independent tools and GUIoStatistics to XML converters• Language independenceoNeed a good symbol locator at run time• Eliminate invasive instrumentation• Cross platform comparisonsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 45
HPCToolkit Philosophy 2Provide information needed for analysis andtuning• Multilanguage applications• Multiple metricsoMust compare metrics which are causes versuseffects (examples: misses, flops, loads,mispredicts, cycles, stall cycles, etc.)• Hide getting details from user as much aspossibleProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 46
HPCToolkit Philosophy 3Eliminate manual labor from analyze, tune,run cycle• Collect multiple data automatically• Eliminate 90-10 ruleo90% of cycles in 10% of code… for a 500K linecode, the hotspot is only 5,000 lines of code.How do you deal with a 5K hotspot???• Drive the process with simple scriptsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 47
Our reference code2D structured multigrid code written in CDouble precision floating-point arithmetic5-point stencilsRed/black Gauss-Seidel smootherFull weighting, linear interpolationDirect solver on coarsest grid (LU, LAPACK)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 48
Structured gridProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 49
Using PCL – Example 1Digital PWS 500au• Alpha 21164, 500 MHz, 1000 MFLOPSpeak• 3 on-chip performance countersPCL Hardware performance monitor: hpm% hpm –-events PCL_CYCLES, PCL_MFLOPS ./mghpm: elapsed time: 5.172 shpm: counter 0 : 2564941490 PCL_CYCLEShpm: counter 1 : 19.635955 PCL_MFLOPSProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 50
Using PCL – Example 2#include int main(int argc, char **argv) {// InitializationPCL_CNT_TYPE i_result[2];PCL_FP_CNT_TYPE fp_result[2];int counter_list[]= {PCL_FP_INSTR, PCL_MFLOPS},res;unsigned int flags= PCL_MODE_USER;PCL_DESCR_TYPE descr;Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 51
Using PCL – Example 2PCLinit(&descr);if(PCLquery(descr,counter_list,2,flags)!=PCL_SUCCESS)// Issue error message …else {PCL_start(descr, counter_list, 2, flags);// Do computational work here …PCLstop(descr,i_result,fp_result,2);printf(“%i fp instructions, MFLOPS: %f\n”,i_result[0], fp_result[1]);}PCLexit(descr);return 0;}Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 52
Using DCPIAlpha-based machines running Tru64 UNIXHow to proceed when using DCPI1. Start the DCPI daemon (dcpid)2. Run your code3. Stop the DCPI daemon4. Use DCPI tools to analyze the profilingdataProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 53
Examples of DCPI toolsdcpiwhatcg: Where have all the cyclesgone?dcpiprof: Breakdown of CPU time byproceduresdcpilist: Code listing (source/assembler)annotated with profiling datadcpitopstalls: Ranking of instructionscausing stall cyclesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 54
Using DCPI – Example 1% dcpiprof ./mgColumn Total Period (for events)------ ----- ------dmiss 45745 4096===================================================dmiss % cum% procedure image33320 72.84% 72.84% mgSmooth ./mg10008 21.88% 94.72% mgRestriction ./mg2411 5.27% 99.99% mgProlongCorr ./mg[…]Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 55
Using DCPI – Example 2Call the DCPI analysis tool:% dcpiwhatcg ./mgDynamic stalls are listed first:I-cache (not ITB) 0.1% to 7.4%ITB/I-cache miss 0.0% to 0.0%D-cache miss 24.2% to 27.6%DTB miss 53.3% to 57.7%Write buffer 0.0% to 0.3%Synchronization 0.0% to 0.0%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 56
Using DCPI – Example 2Branch mispredict 0.0% to 0.0%IMUL busy 0.0% to 0.0%FDIV busy 0.0% to 0.5%Other 0.0% to 0.0%Unexplained stall 0.4% to 0.4%Unexplained gain -0.7% to -0.7%---------------------------------------Subtotal dynamic 85.1%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 57
Using DCPI – Example 2Static stalls are listed next:Slotting 0.5%Ra dependency 3.0%Rb dependency 1.6%Rc dependency 0.0%FU dependency 0.5%-----------------------------------------------Subtotal static 5.6%-----------------------------------------------Total stall 90.7%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 58
Using DCPI – Example 2Useful cycles are listed in the end:Useful 7.9%Nops 1.3%-----------------------------------------------Total execution 9.3%Compare to the total percentage of stall cycles:90.7% (cf. previous slide)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 59
Part IIOptimization TechniquesforStructured GridsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 60
How to make codes fast1 Use a fast algorithm (e.g., multigrid)I. It does not make sense to optimize a badalgorithmII. However, sometimes a fairly simplealgorithm that is well implemented willbeat a very sophisticated, super methodthat is poorly programmed2 Use good coding practices3 Use good data structures4 Apply appropriate optimization techniquesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 61
Optimization ofFloating-PointOperationsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 62
Optimization of FP operationsLoop unrollingFused Multiply-Add (FMA) instructionsExposing instruction-level parallelism (ILP)Software pipelining (again: exploit ILP)AliasingSpecial functionsEliminating overheads• if statements• Loop overhead• Subroutine calling overheadProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 63
Loop unrollingSimplest effect of loop unrolling: fewer test/jumpinstructions (fatter loop body, less loop overhead)Fewer loads per flopMay lead to threaded code that uses multiple FPunits concurrently (instruction-level parallelism)How are loops handled that have a trip count that isnot a multiple of the unrolling factor?Very long loops may not benefit from unrolling(instruction cache capacity!)Very short loops may suffer from unrolling or benefitstronglyProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 64
Loop unrolling: Making fatter loop bodiesdo i=1,Na(i)= a(i)+b(i)*cenddoii= imod(N,4)do i= 1,iia(i)= a(i)+b(i)*cenddodo i= 1+ii,N,4a(i)= a(i)+b(i)*ca(i+1)= a(i+1)+b(i+1)*ca(i+2)= a(i+2)+b(i+2)*ca(i+3)= a(i+3)+b(i+3)*cenddoExample: DAXPY operationdo i= 1,N,4a(i)= a(i)+b(i)*ca(i+1)= a(i+1)+b(i+1)*ca(i+2)= a(i+2)+b(i+2)*ca(i+3)= a(i+3)+b(i+3)*cenddoPreconditioning loop handlescases when N is not a multiple of 4Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 65
Loop unrolling:Improving flop/load ratioAnalysis of the flop-to-load-ratio often unveils anotherbenefit of unrolling:do i= 1,Ndo j= 1,My(i)=y(i)+a(j,i)*x(j)enddoenddoInnermost loop: three loads andtwo flops performed; i.e., wehave one load per flopProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 66
Loop unrolling:Improving flop/load ratiodo i= 1,N,2t1= 0Both loops unrolled twicet2= 0do j= 1,M,2t1= t1+a(j,i) *x(j) +a(j+1,i) *x(j+1)t2= t2+a(j,i+1)*x(j) +a(j+1,i+1)*x(j+1)enddoy(i) = t1Innermost loop: 8 loads and 8 flops!y(i+1)= t2Exposes instruction-level parallelismenddoHow about unrolling by 4?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityWatch out for register spill!HiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 67
Fused Multiply-Add (FMA)On many CPUs (e.g., IBM Power3/Power4)there is an instruction which multiplies twooperands and adds the result to a thirdConsider codea= b + c*d + f*gversusa= c*d + f*g + bCan reordering be done automatically?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 68
Exposing ILPprogram nrm1real a(n)tt= 0d0do j= 1,ntt= tt + a(j) * a(j)enddoprint *,ttendprogram nrm2real a(n)tt1= 0d0tt2= 0d0do j= 1,n,2tt1= tt1 + a(j)*a(j)tt2= tt2 + a(j+1)*a(j+1)enddott= tt1 + tt2print *, ttendProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 69
Exposing ILP (cont’d)Superscalar CPUs have a high degree of onchipparallelism that should be exploitedThe optimized code uses temporary variablesto indicate independent instruction streamsThis is more than just loop unrolling!Can this be done automatically?Change in rounding errors?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 70
Software pipeliningArranging instructions in groups that can beexecuted together in one cycleAgain, the idea is to exploit instruction-levelparallelism (on-chip parallelism)Often done by optimizing compilers, but notalways successfullyClosely related to loop unrollingLess important on out-of-order CPUsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 71
AliasingArrays (or other data) that refer to the samememory locationsAliasing rules are different for variousprogramming languages; e.g.,• FORTRAN forbids aliasing: unknown result• C/C++ permit aliasingThis is one reason why FORTRAN compilersoften produce faster code than C/C++compilers doProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 72
Aliasing (cont’d)Example:subroutine sub(n, a, b, c, sum)double precision sum, a(n), b(n), c(n)sum= 0d0do i= 1,na(i)= b(i) + 2.0d0*c(i)enddoreturnendFORTRAN rule: two variables cannot be aliased, when one or bothof them are modified in the subroutineCorrect call: call sub(n,a,b,c,sum)Incorrect call: call sub(n,a,a,c,sum)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 73
Aliasing (cont’d)Aliasing is legal in C/C++: compiler mustproduce conservative codeMore complicated aliasing is possible; e.g.,a(i) with a(i+2)C/C++ keyword restrict or compiler option-noaliasProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 74
Special functions / (divide) sqrt exp, log sin, cos, … Etc.are expensive (up to several dozen cycles)Use math Identities, e.g., log(x) + log(y) = log(x*y)Use special libraries that• vectorize when many of the same functionsmust be evaluated• trade accuracy for speed, when appropriateProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 75
Eliminating overheads:if statementsif statements … Prohibit some optimizations(e.g., loop unrolling in some cases) Evaluating the condition expression takes time CPU pipeline may be interrupted (dynamic jump prediction)Goal: avoid if statements in the innermost loopsNo generally applicable technique exists Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 76
Eliminating if statements:An examplesubroutine thresh0(n,a,thresh,ic)dimension a(n)ic= 0tt= 0.d0do j= 1,ntt= tt + a(j) * a(j)if (sqrt(tt).ge.thresh) thenic= jreturnendifenddoreturnendAvoid sqrt in condition!(square thresh instead)Add tt in blocks of 128 forexample (without condition) andrepeat last block when conditionis violatedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 77
Eliminating loop overheadsFor starting a loop, the CPU must free certainregisters: loop counter, address, etc.This may be significant for a short loop!Example: for n>mdo i= 1,ndo j= 1,m...is less efficient thando j= 1,mdo i= 1,n...However, data access optimizations are even moreimportant, see belowProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 78
Eliminating subroutine callingoverheadSubroutines (functions) are very important forstructured, modular programmingSubroutine calls are expensive (on the order of up to100 cycles)Passing value arguments (copying data) can beextremely expensive, when used inappropriatelyPassing reference arguments (as in FORTRAN) maybe dangerous from a point of view of correct softwareReference arguments (as in C++) with constdeclarationGenerally, in tight loops, no subroutine calls shouldbe usedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 79
Eliminating subroutine callingoverhead (cont’d)Inlining: inline declaration in C++ (see below), ordone automatically by the compilerMacros in C or any other language#define sqre(a) (a)*(a)What can go wrong:sqre(x+y) x+y*x+ysqre(f(x)) f(x) * f(x)What if f has side effects?What if f has no side effects, but the compiler cannotdeduce that?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 80
Memory HierarchyOptimizations:Data LayoutProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 81
Data layout optimizationsArray transpose to get stride-1 accessBuilding cache-aware data structures byarray mergingArray paddingEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 82
Data layout optimizationsStride-1 access is usually fastest for severalreasons; particularly the reuse of cache linecontentsData layout for multidimensional arrays inFORTRAN: column-major orderExample:4x3 arrayA(i,j)MemoryaddressProf. Craig C. DouglasUniversity of Kentucky andYale University840A(1,1)A(1,2)A(1,3)A(2,1)A(2,2)A(2,3)A(3,1)A(3,2)A(3,3)A(4,1)A(4,2)A(4,3)Data arrangement is “transpose” of usualmatrix layoutHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 83
Data layout optimizationsStride-1 access: innermost loop iterates over first index Either by choosing the right data layout (arraytranspose) or By arranging nested loops in the right order (loopinterchange):do i=1,Ndo j=1,Ma(i,j)=a(i,j)+b(i,j)enddoenddoProf. Craig C. DouglasUniversity of Kentucky andYale Universitydo j=1,Mdo i=1,Na(i,j)=a(i,j)+b(i,j)enddoenddoStride-N accessStride-1 accessThis will usually be done by the compiler!HiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 84
Data layout optimizations:Stride-1 accessdo i=1,Ndo j=1,Ms(i)=s(i)+b(i,j)*c(j)enddoenddoBetter transpose matrix b sothat inner loop gets stride 1How about loop interchange in this case?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 85
Data layout optimizations:<strong>Cache</strong>-aware data structuresIdea: Merge data which are needed togetherto increase spatial locality: cache linescontain several data itemsExample: Gauss-Seidel iteration, determinedata items needed simultaneously⎛−1= ⎜− ∑k + 1a− ∑i , ifiai , juj⎝ j < ij > ik + 1kuiai , juj⎞⎟⎠Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 86
Data layout optimizations:<strong>Cache</strong>-aware data structuresExample (cont’d): right-hand side andcoefficients are accessed simultaneously,reuse cache line contents by array merging enhance spatial localitytypedef struct {double f;double c_N, c_E, c_S, c_W, c_C;} equationData; // Data merged in memorydouble u[N][N];// Solution vectorequationData rhsAndCoeff[N][N]; // Right-hand side// and coefficientsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 87
Data layout optimizations:Array paddingIdea: Allocate arrays larger than necessary• Change relative memory distances• Avoid severe cache thrashing effectsExample (FORTRAN: column-major order):Replacedouble precision u(1024, 1024)bydouble precision u(1024+pad, 1024)How to choose pad?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 88
Data layout optimizations:Array paddingC.-W. Tseng et al. (UMD):Research on cache modeling and compilerbasedarray padding:• Intra-variable padding: pad within arrays⇒ Avoid self-interference misses• Inter-variable padding: pad betweendifferent arrays⇒ Avoid cross-interference missesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 89
Data layout optimizations:Array paddingPadding in 2D; e.g., FORTRAN77:double precision u(0:1024+pad,0:1024)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 90
Memory HierarchyOptimizations:Data AccessProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 91
Loop optimizationsLoop unrolling (see above)Loop interchangeLoop fusionLoop split = loop fission = loop distributionLoop skewingLoop blockingEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 92
Data access optimizations:Loop fusion• Idea: Transform successive loops into asingle loop to enhance temporal locality• Reduces cache misses and enhances cachereuse (exploit temporal locality)• Often applicable when data sets areprocessed repeatedly (e.g., in the case ofiterative methods)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 93
Data access optimizations:Loop fusionBefore:do i= 1,Na(i)= a(i)+b(i)enddodo i= 1,Na(i)= a(i)*c(i)enddo• a is loaded into thecache twice (ifsufficiently large)After:do i= 1,Na(i)= (a(i)+b(i))*c(i)enddo• a is loaded into thecache only onceProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 94
Data access optimizations:Loop fusionExample: red/black Gauss-Seidel iteration in 2DProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 95
Data access optimizations:Loop fusionCode before applying loop fusion technique(standard implementation w/ efficient loopordering, Fortran semantics: row major order):for it= 1 to numIter do// Red nodesfor i= 1 to n-1 dofor j= 1+(i+1)%2 to n-1 by 2 dorelax(u(j,i))end forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 96
Data access optimizations:Loop fusion// Black nodesfor i= 1 to n-1 dofor j= 1+i%2 to n-1 by 2 dorelax(u(j,i))end forend forend forThis requires two sweeps through the wholedata set per single GS iteration!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 97
Data access optimizations:Loop fusionHow the fusion technique works:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 98
Data access optimizations:Loop fusionCode after applying loop fusion technique:for it= 1 to numIter do// Update red nodes in first grid rowfor j= 1 to n-1 by 2 dorelax(u(j,1))end forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 99
Data access optimizations:Loop fusion// Update red and black nodes in pairsfor i= 1 to n-1 dofor j= 1+(i+1)%2 to n-1 by 2 dorelax(u(j,i))relax(u(j,i-1))end forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 100
Data access optimizations:Loop fusion// Update black nodes in last grid rowfor j= 2 to n-1 by 2 dorelax(u(j,n-1))end forSolution vector u passes through thecache only once instead of twice per GSiteration!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 101
Data access optimizations:Loop splitThe inverse transformation of loop fusionDivide work of one loop into two to makebody less complicated• Leverage compiler optimizations• Enhance instruction cache utilizationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 102
Data access optimizations:Loop blockingLoop blocking = loop tilingDivide the data set into subsets (blocks)which are small enough to fit in cachePerform as much work as possible on thedata in cache before moving to the next blockThis is not always easy to accomplishbecause of data dependenciesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 103
Data access optimizations:Loop blockingExample: 1D blocking for red/black GS, respectthe data dependencies!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 104
Data access optimizations:Loop blockingCode after applying 1D blocking techniqueB = number of GS iterations to beblocked/combinedfor it= 1 to numIter/B do// Special handling: rows 1, …, 2B-1// Not shown here …Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 105
Data access optimizations:Loop blocking// Inner part of the 2D gridfor k= 2*B to n-1 dofor i= k to k-2*B+1 by –2 dofor j= 1+(k+1)%2 to n-1 by 2 dorelax(u(j,i))relax(u(j,i-1))end forend forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 106
Data access optimizations:Loop blocking// Special handling: rows n-2B+1, …, n-1// Not shown here …end forResult: Data is loaded once into the cacheper B Gauss-Seidel iterations, if 2*B+2 gridrows fit in the cache simultaneouslyIf grid rows are too large, 2D blocking can beappliedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 107
Data access optimizationsLoop blockingMore complicated blocking schemes existIllustration: 2D square blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 108
Data access optimizations:Loop blockingIllustration: 2D skewed blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 109
Two common multigridalgorithmsV Cycle to solve A 4 u 4 =f 4Smooth A 4u 4=f 4. Set f 3= R 3r 4.Smooth A 3u 3=f 3. Set f 2= R 2r 3.Smooth A 2u 2=f 2. Set f 1= R 1r 2.Solve A 1u 1=f 1directly.Set u 4= u 4+ I 3u 3. Smooth A 4u 4=f 4.Set u 3= u 3+ I 2u 2. Smooth A 3u 3=f 3.Set u 2= u 2+ I 1u 1. Smooth A 2u 2=f 2.W CycleProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 110
<strong>Cache</strong>-optimized multigrid:DiMEPACK libraryDFG project DiME: Data-local iterative methodsFast algorithm + fast implementationCorrection scheme: V-cycles, FMGRectangular domainsConstant 5-/9-point stencilsDirichlet/Neumann boundary conditionsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 111
DiMEPACK libraryC++ interface, fast Fortran77 subroutinesDirect solution of the problems on thecoarsest grid (LAPACK: LU, Cholesky)Single/double precision floating-pointarithmeticVarious array padding heuristics (Tseng)http://www10.informatik.uni-erlangen.de/dimeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 112
V(2,2) cycle - bottom lineMflops1356150220For whatStandard 5-pt. Operator<strong>Cache</strong> optimized (loop orderings, datamerging, simple blocking)Constant coeff. + skewed blocking +paddingEliminating rhs if 0 everywhere butboundaryProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 113
Example:<strong>Cache</strong>-OptimizedMultigrid on RegularGrids in 3DProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 114
Data layout optimizationsfor 3D multigridArray paddingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 115
Data layout optimizationsfor 3D multigridStandard padding in 3D; e.g., FORTRAN77:double precision u(0:1024,0:1024,0:1024)becomes:double precision u(0:1024+pad1,0:1024+pad2,0:1024)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 116
Data layout optimizationsfor 3D multigrid Non-standard padding in 3D:double precision u(0:1024+pad1,0:1024,0:1024)...u(i+k*pad2, j, k)(or use hand-made index linearization – performance effect?)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 117
Data layout optimizationsfor 3D multigridArray mergingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 118
Data access optimizationsfor 3D multigrid1-way blocking with loop-interchangeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 119
Data access optimizationsfor 3D multigrid2-way blocking and 3-way blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 120
Data access optimizationsfor 3D multigrid4-way blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 121
Example:<strong>Cache</strong> Optimizations forthe Lattice BoltzmannMethodProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 122
Lattice Boltzmann method Mainly used in CFD applications Employs a regular grid structure (2D, 3D) Particle-oriented approach based on amicroscopic model of the moving fluidparticles Jacobi-like cell update pattern: a single timestep of the LBM consits of• stream step and• collide stepProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 123
LBM (cont‘d)Stream: read distribution functions fromneighborsCollide: re-compute own distribution functionsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 124
Data layout optimizationLayout 1:Two separate grids(standard approach)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 125
Layout 2: Grid Compression: save memory, enhance localityProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 126
Data accessoptimizationsAccess pattern 1: 3-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 127
3-way blocking (cont’d):Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 128
Access pattern 2: 4-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 129
4-way blocking (cont’d):Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 130
Illustration of the combination of layout + access optimizationsLayout: separate grids, access pattern: 3-way-blocking:Layout: separate grids, access pattern: 4-way-blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 131
Layout: Grid compression, access pattern: 3-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 132
Layout: grid compression, access pattern: 4-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 133
Performance resultsMFLOPS for 2D GS, const. coeff.s, 5-pt.,DEC PWS 500au, Alpha 21164 CPU, 500 MHzProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 134
Memory access behaviorDigital PWS 500au, Alpha 21164 CPUL1 = 8 KB, L2 = 96 KB, L3 = 4 MBWe use DCPI to obtain the performance dataWe measure the percentage of accesses which aresatisfied by each individual level of the memoryhierarchyComparison: standard implementation of red/blackGS (efficient loop ordering) vs. 2D skewed blocking(with and without padding)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 135
Memory access behaviorStandard implementation of red/black GS,without array paddingSize+/-L1L2L3Mem.334.563.632.00.00.0650.575.723.60.20.0129-0.276.19.314.80.02575.355.125.014.50.05133.937.745.212.40.810255.127.850.09.97.220494.530.345.013.07.2Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 136
Memory access behavior2D skewed blocking without array padding,4 iterations blocked (B = 4)Size336512925751310252049+/-27.433.436.938.138.036.936.2L143.446.342.334.128.324.925.5L229.119.519.125.127.019.70.4L30.10.91.72.76.717.636.9Mem.0.00.00.00.00.10.90.9Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 137
Memory access behavior2D skewed blocking with appropriate arraypadding, 4 iterations blocked (B = 4)Size+/-L1L2L3Mem.3328.266.45.30.00.06534.355.79.10.90.012937.551.79.01.90.025737.852.87.02.30.051338.452.76.22.40.3102536.754.36.12.00.9204935.955.26.01.90.9Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 138
Performance results (cont’d)3D MG, F77, var. coeff.s, 7-pt., Intel Pentium4,2.4 GHz, Intel ifc V7.0 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 139
Performance results (cont’d)2D LBM (D2Q9), C(++), AMD Athlon XP 2400+,2.0 GHz, Linux, gcc V3.2.1 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 140
Performance results (cont’d)<strong>Cache</strong> behavior (left: L1, right: L2) for previousexperiment, measured with PAPIProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 141
Performance results (cont’d)3D LBM (D3Q19), C, AMD Opteron, 1.6 GHz,Linux, gcc V3.2.2 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 142
C++-SpecificConsiderationsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 143
C++-specific considerationsWe will (briefly) address the following issues:InliningVirtual functionsExpression templatesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 144
InliningMacro-like code expansion: replace functioncall by the body of the function to be inlinedHow to accomplish inlining:• Use C++ keyword inline, or• Define the method within the declarationIn any case: the method to be inlined needsto be defined in the header fileHowever: inlining is just a suggestion to thecompiler!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 145
Inlining (cont‘d)Advantages:• Reduce function call overhead (see above)• Leverage cross-call optimizations: optimizethe code after expanding the loop body Disadvantage:Size of the machine code increases (instructioncache capacity!)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 146
Virtual functionsMember functions may be declared to be virtual(C++ keyword virtual)This mechanism becomes relevant when base classpointers are used to point to instances of derivedclassesActual member function to be called can often bedetermined only at runtime (polymorphism)Requires virtual function table lookup (at runtime!)Can be very time-consuming!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 147
Inlining virtual functionsVirtual functions are often not compatible withinlining where function calls are replaced byfunction bodies at compile time.If the type of the object can be deduced atcompile time, the compiler can even inlinevirtual functions (at least theoretically ...)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 148
Expression templatesC++ technique for passing expressions asfunction argumentsExpression can be inlined into the functionbody using (nested) C++ templatesAvoid the use of temporaries and thereforemultiple passes of the data through thememory subsystem; particularly the cachehierarchyProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 149
ExampleDefine a simple vector class in the beginning:class vector {private:int length;double a[];public:vector(int l);double component(int i) { return a[i]; }...};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 150
Example (cont‘d)Want to efficiently compute vector sums likec= a+b+d;Efficiently implies Avoiding the generation of temporary objects Pumping data through the memory hierarchyseveral times. This is actually the timeconsumingpart. Moving data is moreexpensive than processing data!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 151
Example (cont‘d)Need a wrapper class for all expressions:templateclass DExpr { // double precision expressionprivate:A wa;public:DExpr(const A& a) : wa(a) {}double component(int i) { return wa.component(i); }};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 152
Example (cont‘d)Need an expression template class to represent sumsof expressionstemplateclass DExprSum {A va;B vb;public:DExprSum(const A& a, const B& b) : va(a), vb(b) {}double component(int i) {return va.component(i) + vb.component(i);}};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 153
Example (cont‘d)Need overloaded operator+() variants for allpossible return types, for example:templateDExproperator+(const DExpr& a, const DExpr& b) {typedef DExprSum ExprT;return DExpr(ExprT(a,b));};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 154
Example (cont‘d)The vector class must contain a memberfunction operator=(const A& ea), where A isan expression template class.Only when this member function is called, theactual computation (vector sum) takes place.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 155
Part IIIOptimization Techniques for Unstructured GridsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 156
Optimizations forunstructured gridsHow unstructured is the grid?Sparse matrices and data flow analysisGrid processingAlgorithm processingExamplesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 157
Is It really unstructured?This is really a quasi-unstructured mesh: there is plenty of structure inmost of the oceans. Coastal areas provide a real challenge.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 158
Subgrids and patchesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 159
Motivating example• Suppose probleminformation for only half ofnodes fits in cache.• Gauss-Seidel updates nodesin order• Leads to poor use of cache• By the time node 37 isupdated, information fornode 1 has probablybeen evicted from cache.• Each unknown must bebrought into cache ateach iteration.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 160
Motivating example• Alternative• Divide into two connectedsubsets.• Renumber• Update as much as possiblewithin a subset beforevisiting the other.• Leads to better data reusewithin cache.• Some unknowns can becompletely updated.• Some partial residuals can becalculated.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 161
<strong>Cache</strong> aware Gauss-SeidelPreprocessing phase• Decompose each mesh into disjoint cache blocks.• Renumber the nodes in each block.• Find structures in quasi-unstructured case.• Produce the system matrices and intergrid transfer operatorswith the new ordering.Gauss-Seidel phase• Update as much as possible within a block without referencingdata from another block.• Calculate a (partial) residual on the last update.• Backtrack to finish updating cache block boundaries.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 162
Preprocessing: Mesh decompositionGoals• Maximize interior of cache block.• Minimize connections between cache blocks.Constraint• <strong>Cache</strong> should be large enough to hold the part of matrix, righthand side, residual, and unknown associated with a cache block.Critical parameter: usable cache size.Such decomposition problems have been studied in depth for loadbalancing parallel computation.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 163
Example of subblock membership<strong>Cache</strong> blocks identifiedSubblocks identifiedˆ 2Ω h<strong>Cache</strong> block2L 12L 2L 3boundary2•• •ˆ 1hΩProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 164
Distance algorithmsSeveral constants• d The degrees of freedom per vertex.• K Average number of connections per vertex.• N Ω Number of vertices in cache block N Ω .Three cases for complexity bounds• <strong>Cache</strong> boundaries connected.• Physical boundaries unknown.• Physical boundaries known.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 165
Standard Gauss-SeidelThe complexity bound C gs in this notation is given byC gs ≤ 2d 2 N Ω K+d.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 166
<strong>Cache</strong> boundaries connectedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 167
<strong>Cache</strong> boundaries connectedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 168
<strong>Cache</strong> boundaries connectedThe complexities of Algorithms 1 and 2 areandC 1 ≤ 5N Ω KC 2 ≤ (7K+1)N Ω .The cost of Algorithms 1 and 2 with respect toGauss-Seidel is at most 6d -2 sweeps on the finest grid.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 169
Physical boundaries unknownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 170
Physical boundaries unknownThe complexity of Algorithm 3 isC 1 ≤ 9N Ω K.The cost of Algorithms 1 and 3 with respect toGauss-Seidel is at most 7d -2 sweeps on the finest grid.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 171
Physical boundaries knownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 172
Physical boundaries knownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 173
Physical boundaries knownThe complexity of Algorithms 1, 4, and 5 isC 1,4,5 ≤ (11K+1)N Ω .The cost of Algorithms 1, 4, and 5 with respect to Gauss-Seidelis at most 10d -2 sweeps on the finest grid. This assumes thatthe number of physical boundary nodes is (N Ω ) ½ , which is quitepessimistic.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 174
Physical boundaries knownIf we assume that there are no more than ½N Ω physical boundarynodes per cache block, then we get a better bound, namely, the costof Algorithms 1, 4, and 5 with respect to Gauss-Seidel is at most5.5d -2 sweeps on the finest grid.Clearly with better estimates of realistic number of physicalboundary nodes (i.e., much less than ½N Ω ) per cache block, we canreduce this bound.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 175
Preprocessing costsPessimistic (never seen to date in a real problem) bounds• d = 1: 5.5 Gauss-Seidel sweeps• d = 2: 1.375 Gauss-Seidel sweeps• d = 3: 0.611 Gauss-Seidel sweepsIn practice bounds• d = 1: ~1 Gauss-Seidel sweeps• d = 2: ~ ½ Gauss-Seidel sweepsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 176
Numerical results: AustriaExperiment: AustriaTwo dimensional elasticityT = Cauchy stress tensor, w = displacementf = (1,-1) T on Γ 4 ,(9.5-x,4-y) if (x,y) is in region surrounded by Γ 5 , and0 otherwise.–∇T = f in Ω,∂w/∂n = 100w on Γ 1 ,∂w/∂y = 100w on Γ 2 ,∂w/∂x = 100w on Γ 3 ,∂w/∂n = 0 everywhere else.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 177
Numerical experiments: BavariaExperiment: BavariaCoarse grid meshStationary heat equationwith 7 sources and one sink (MunichOktoberfest).Homogeneous Dirichlet boundaryconditions on the Czech border(northeast), homogeneous Neumannb.c.’s everywhere else.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 178
Code commentaryPreprocessing steps separate.Computation• Standard algorithms implemented in as efficient a manner knownwithout the cache aware or active set tricks. This is not RandyBank’s version, but much, much faster.• <strong>Cache</strong> aware implementations uses quasi-unstructuredinformation.• The codes work equally well in 2D and 3D (the latter needs alarger cache than 2D) and are really general sparse matrix codeswith a PDE front end.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 179
Implementation detailsOne parameter tuning: usable cache size, which is normally ~60% ofphysical cache size.Current code• Fortran + C (strict spaghetti coding style and fast).• Requires help from authors to add a new domain and coarsegrid.New code• C++• Should not require any author to be present.• Is supposed to be available by end of September and willprobably be done earlier. Aiming for inclusion in ML (seesoftware.sandia.gov)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 180
Books W. Briggs, V.E. Henson, S.M. McCormick, A Multigrid <strong>Tutorial</strong>,SIAM, 2000. D. Bulka, D. Mayhew, Efficient C++, Addison-Wesley, 2000. C.C. Douglas, G. Haase, U. Langer, A <strong>Tutorial</strong> on Elliptic PDESolvers and Their Parallelization, SIAM, 2003. S. Goedecker, A. Hoisie: Performance Optimization ofNumerically Intensive Codes, SIAM, 2001. J. Handy, The <strong>Cache</strong> Memory Book, Academic Press, 1998J. L. Hennessy and D. A. Patterson, Computer Architecture: AQuantitative Approach, 2 nd ed., Morgan Kauffmann Publishers,1996.U. Trottenberg, A. Schuller, C. Oosterlee, Multigrid, AcademicPress, 2000.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 181
Journal articles C.C. Douglas, Caching in with multigrid algorithms: problems intwo dimensions Parallel Algorithms and Applications, 9 (1996),pp. 195-204. C.C. Douglas, G. Haase, J. Hu, W. Karl, M. Kowarschik, U.Rüde, C. Weiss, Portable memory hierarchy techniques for PDEsolvers, Part I, SIAM News 33/5 (2000), pp. 1, 8-9. Part II, SIAMNews 33/6 (2000), pp. 1, 10-11, 16. C. C. Douglas, J. Hu, W. Karl, M. Kowarschik, U. Rüde, C.Weiss, <strong>Cache</strong> optimization for structured and unstructured gridmultigrid, Electronic Transactions on Numerical Analysis, 10(2000), pp. 25-40. C. Weiss, M. Kowarschik, U. Rüde, and W. Karl, <strong>Cache</strong>-awaremultigrid methods for solving Poisson's equation in twodimension, Computing, 64(2000), pp. 381-399.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 182
Conference proceedingsM. Kowarschik, C. Weiss, DiMEPACK – A cache-optimizedmultigrid library, Proc. of the Intl. Conference on Parallel andDistributed Processing Techniques and Applications (PDPTA2001), vol. I, June 2001, pp. 425-430.U. Rüde, Iterative algorithms on high performance architectures,Proc. of the EuroPar97 Conference, Lecture Notes in ComputerScience, Springer, Berlin, 1997, pp. 57-71.U. Rüde, Technological trends and their Impact on the future ofsupercomputing, H.-J. Bungartz, F. Durst, C. Zenger (eds.),High Performance Scientific and Engineering Computing,Lecture Notes in Computer Science, Vol. 8, Springer, 1998, pp.459-471.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 183
Conference proceedingsC. Weiss, W. Karl, M. Kowarschik, U. Rüde, Memorycharacteristics of iterative methods, Proc. of theSupercomputing Conference, Portland, Oregon, November1999.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 184
Related websiteshttp://www.mgnet.orghttp://www.mgnet.org/~douglas/ccd-preprints.htmlhttp://www.mgnet.org/~douglas/ml-dddas.htmlhttp://www10.informatik.uni-erlangen.de/dimehttp://valgrind.kde.orghttp://www.fz-juelich.de/zam/PCLhttp://icl.cs.utk.edu/projects/papihttp://www.hipersoft.rice.edu/hpctoolkithttp://www.tru64unix.compaq.com/dcpihttp://ec-securehost.com/SIAM/SE16.htmlProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 185