Cache Usage Tutorial - MGNet

Part IArchitectures and FundamentalsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 4

Architectures and fundamentalsWhy worry about performance- an illustrative exampleFundamentals of computer architecture• CPUs, pipelines, superscalarity• Memory hierarchyBasic efficiency guidelinesProfilingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 5

How fast should a solver be?(just a simple check with theory)Poisson problem can be solved by a multigridmethod in < 30 operations per unknown(known since late 70’s)More general elliptic equations may needO(100) operations per unknownA modern CPU can do 1-6 GFLOPSSo we should be solving 10-60 millionunknowns per secondShould need O(100) Mbytes of memoryProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 6

How fast are solvers today?Often no more than 10,000 to 100,000 unknownspossible before the code breaksIn a time of minutes to hoursNeeding horrendous amounts of memoryEven state of the art codesare often very inefficientProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 7

Comparison of solvers(what got me started in this business ~ '95)Compute Time in Seconds1010.10.01Unstructured codeSORStructured grid MultigridOptimal MultigridProf. Craig C. DouglasUniversity of Kentucky andYale University1024 4096 16384HiPC2003, 12/17/2003 vHPC Cache Aware Methods 8

Elements of CPU architectureModern CPUs are• Superscalar: they can execute more than oneoperation per clock cycle, typically:oo4 integer operations per clock cycle plus2 or 4 floating-point operations (multiply-add)• Pipelined:ooFloating-point ops take O(10) clock cycles to completeA set of ops can be started in each cycle• Load-store: all operations are done on data inregisters, all operands must be copied to/frommemory via load and store operationsCode performance heavily dependent on compiler(and manual) optimizationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 9

PipeliningProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 10

Pipelining (cont’d)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 11

CPU trendsEPIC (similar to VLIW) (IA64)Multi-threaded architectures (Alpha, Pentium4HT) Multiple CPUs on a single chip (IBM Power 4)Within the next decade• Billion transistor CPUs (today 200 milliontransistors)• Potential to build TFLOPS on a chip (e.g., SUNgraphics processors)• But no way to move the data in and out sufficientlyquickly!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 12

Memory wallLatency: time for memory to respond to a read (orwrite) request is too long• CPU ~ 0.5 ns (light travels 15cm in vacuum)• Memory ~ 50 nsBandwidth: number of bytes which can be read(written) per second• CPUs with 1 GFLOPS peak performancestandard: needs 24 Gbyte/sec bandwidth• Present CPUs have peak bandwidth

Memory acceleration techniquesInterleaving (independent memory banks storeconsecutive cells of the address space cyclically)• Improves bandwidth• But not latencyCaches (small but fast memory) holding frequentlyused copies of the main memory• Improves latency and bandwidth• Usually comes with 2 or 3 levels nowadays• But only works when access to memory is localProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 14

Principles of locality Temporal locality: an item referenced now will bereferenced again soon Spatial locality: an item referenced now indicates thatneighbors will be referenced soon Cache lines are typically 32-128 bytes with 1024being the longest recently. Lines, not words, aremoved between memory levels. Both principles aresatisfied. There is an optimal line size based on theproperties of the data bus and the memorysubsystem designs.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 15

CachesFast but small extra memoryHolding identical copies of main memoryLower latencyHigher bandwidth Usually several levels (2, 3, or 4)Same principle as virtual memoryMemory requests are satisfied from• Fast cache (if it holds the appropriate copy):Cache Hit• Slow main memory (if data is not in cache):Cache MissProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 16

Typical cache configurationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 17

Cache issuesUniqueness and transparency of the cacheFinding the working set (what data is kept incache)Data consistency with main memoryLatency: time for memory to respond to aread (or write) requestBandwidth: number of bytes that can be read(written) per secondProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 18

Cache issues (cont’d)Cache line size• Prefetching effect• False sharing (cf. associativity issues)Replacement strategy• Least Recently Used (LRU)• Least Frequently Used (LFU)• RandomTranslation lookaside buffer (TLB)• Stores virtual memory page translation entries• Has effect similar to another level of cache• TLB misses are very expensiveProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 19

Effect of cache hit ratioThe cache efficiency is characterized by thecache hit ratio, the effective time for a dataaccess isThe speedup is then given byProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 20

Cache effectiveness depends onthe hit ratioHit ratios of 90% and better are neededfor good speedupsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 21

Cache organizationNumber of cache levelsSet associativityPhysical or virtual addressingWrite-through/write-back policyReplacement strategy (e.g., Random/LRU)Cache line sizeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 22

Cache associativity Direct mapped (associativity = 1)oEach cache block can be stored in exactly onecache line of the cache memoryFully associativeoA cache block can be stored in any cache line Set-associative (associativity = k)oEach cache block can be stored in one of kplaces in the cacheDirect mapped and set-associative caches give rise toconflict misses.Direct mapped caches are faster, fully associative cachesare too expensive and slow (if reasonably large).Set-associative caches are a compromise.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 23

Typical architectures IBM Power 3:• L1 = 64 KB, 128-way set associative (funny definition, however)• L2 = 4 MB, direct mapped, line size = 128, write back IBM Power 4 (2 CPU/chip):• L1 = 32 KB, 2-way, line size = 128• L2 = 1,5 MB, 8-way, line size = 128• L3 = 32 MB, 8-way, line size = 512 Compaq EV6 (Alpha 21264):• L1 = 64 KB, 2-way associative, line size= 32• L2 = 4 MB (or larger), direct mapped, line size = 64 HP PA-RISC:• PA8500, PA8600: L1 = 1.5 MB, PA8700: L1 = 2.25 MB• no L2 cache!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 24

Typical architectures (cont’d) AMD Athlon (from “Thunderbird” on):• L1 = 64 KB, L2 = 256 KB Intel Pentium 4:• L1 = 8 KB, 4-way, line size = 64• L2 = 256 KB up to 2MB, 8-way, line size = 128 Intel Itanium:• L1 = 16 KB, 4-way• L2 = 96 KB, 6-way• L3: off-chip, size varies Intel Itanium 2 (McKinley / Madison):• L1 = 16 / 32 KB• L2 = 256 / 256 KB• L3: 1.5 or 3 / 6 MBProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 25

Basic efficiency guidelinesChoose the best algorithmUse efficient librariesFind good compiler optionsUse suitable data layoutsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 26

Choose the best algorithmExample: Solution of linear systems arisingfrom the discretization of a special PDEGaussian elimination (standard): n 3 /3 opsBanded Gaussian elimination: 2n 2 opsSOR method: 10n 1.5 opsMultigrid method: 30n opsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 27

Choose the best algorithm(cont‘d)For n large, the multigrid method will alwaysoutperform the others, even if it is badlyimplementedFrequently, however, two methods haveapproximately the same complexity, and thenthe better implemented one will winProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 28

Use efficient librariesGood libraries often outperform own softwareClever, sophisticated algorithmsOptimized for target machineMachine-specific implementationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 29

Sources for librariesVendor-independent• Commercial: NAG, IMSL, etc.; onlyavailable as binary, often optimized forspecific platform• Free codes: e.g., NETLIB (LAPACK,ODEPACK, …), usually as source code,not specifically optimizedVendor-specific; e.g., cxml for HP Alpha withhighly tuned LAPACK routinesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 30

Sources for libraries (cont‘d)Many libraries are quasi-standards• BLAS• LAPACK• etc.Parallel libraries for supercomputersSpecialists can sometimes outperformvendor-specific librariesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 31

Find good compiler optionsModern compilers have numerous flags to selectindividual optimization options• -On: successively more aggressiveoptimizations, n=1,...,8• -fast: may change round-off behavior• -unroll• -arch• Etc.Learning about your compiler is usually worth it:RTFM (which may be hundreds of pages long).Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 32

Find good compiler options(cont‘d)Hints: Read man cc (man f77) or cc –help (orwhatever causes the possible options to print) Look up compiler options documented inwww.specbench.org for specific platforms Experiment and compare performance onyour own codesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 33

Use suitable data layoutAccess memory in order! In C/C++, for a 2Dmatrixdouble a[n][m];the loops should be such thatfor (i...)for (j...)a[i][j]...In FORTRAN, it must be the other way roundApply loop interchange if necessary (see below)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 34

Use suitable data layout (cont‘d)Other example: array mergingThree vectors accessed together (in C/C++):double a[n],b[n],c[n];can often be handled more efficiently by usingdouble abc[n][3];In FORTRAN again indices permutedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 35

ProfilingSubroutine-level profiling• Compiler inserts timing calls at thebeginning and end of each subroutine• Only suitable for coarse code analysis• Profiling overhead can be significant• E.g., prof, gprofProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 36

Profiling (cont‘d)Tick-based profiling• OS interrupts code execution regularly• Profiling tool monitors code locations• More detailed code analysis is possible• Profiling overhead can still be significantProfiling using hardware performance monitors• Most popular approach• Will therefore be discussed next in more detailProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 37

Profiling: hardware performancecountersDedicated CPU registers are used to countvarious events at runtime:Data cache misses (for different levels)Instruction cache missesTLB missesBranch mispredictionsFloating-point and/or integer operationsLoad/store instructionsEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 38

Profiling tools: DCPIDCPI = Digital Continuous ProfilingInfrastructure (still supported by currentowner, HP; source is even available)Only for Alpha-based machines runningTru64 UNIXCode execution is watched by a profilingdaemonCan only be used from outside the codehttp://www.tru64unix.compaq.com/dcpiProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 39

Profiling tools: valgrindMemory/thread debugger and cache profiler(4 tools). Part of KDE project: free.Run usingcachegrind programNot an intrusive library, uses hardwarecapabilities of CPUs.Simple to use (even for automatic testing).Julian Seward et alhttp://valgrind.kde.orgProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 40

Profiling tools: PCLPCL = Performance Counter LibraryR. Berrendorf et al., FZ Juelich, GermanyAvailable for many platforms (Portability!)Usable from outside and from inside the code(library calls, C, C++, Fortran, and Javainterfaces)http://www.fz-juelich.de/zam/PCLProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 41

Profiling tools: PAPIPAPI = Performance APIAvailable for many platforms (Portability!)Two interfaces:• High-level interface for simplemeasurements• Fully programmable low-level interface,based on thread-safe groups of hardwareevents (EventSets)http://icl.cs.utk.edu/projects/papiProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 42

Profiling tools: HPCToolkitHigh level portable tools for performancemeasurements and comparisons• Uses browser interface• PAPI should have looked like this• Make a change, check what happens onseveral architectures at oncehttp://www.hipersoft.rice.edu/hpctoolkitRob Fowler et al, Rice University, USAProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 43

hpcview Screen ShotProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 44

HPCToolkit Philosophy 1Intuitive, top down user interface forperformance analysis• Machine independent tools and GUIoStatistics to XML converters• Language independenceoNeed a good symbol locator at run time• Eliminate invasive instrumentation• Cross platform comparisonsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 45

HPCToolkit Philosophy 2Provide information needed for analysis andtuning• Multilanguage applications• Multiple metricsoMust compare metrics which are causes versuseffects (examples: misses, flops, loads,mispredicts, cycles, stall cycles, etc.)• Hide getting details from user as much aspossibleProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 46

HPCToolkit Philosophy 3Eliminate manual labor from analyze, tune,run cycle• Collect multiple data automatically• Eliminate 90-10 ruleo90% of cycles in 10% of code… for a 500K linecode, the hotspot is only 5,000 lines of code.How do you deal with a 5K hotspot???• Drive the process with simple scriptsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 47

Our reference code2D structured multigrid code written in CDouble precision floating-point arithmetic5-point stencilsRed/black Gauss-Seidel smootherFull weighting, linear interpolationDirect solver on coarsest grid (LU, LAPACK)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 48

Structured gridProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 49

Using PCL – Example 1Digital PWS 500au• Alpha 21164, 500 MHz, 1000 MFLOPSpeak• 3 on-chip performance countersPCL Hardware performance monitor: hpm% hpm –-events PCL_CYCLES, PCL_MFLOPS ./mghpm: elapsed time: 5.172 shpm: counter 0 : 2564941490 PCL_CYCLEShpm: counter 1 : 19.635955 PCL_MFLOPSProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 50

Using PCL – Example 2#include int main(int argc, char **argv) {// InitializationPCL_CNT_TYPE i_result[2];PCL_FP_CNT_TYPE fp_result[2];int counter_list[]= {PCL_FP_INSTR, PCL_MFLOPS},res;unsigned int flags= PCL_MODE_USER;PCL_DESCR_TYPE descr;Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 51

Using PCL – Example 2PCLinit(&descr);if(PCLquery(descr,counter_list,2,flags)!=PCL_SUCCESS)// Issue error message …else {PCL_start(descr, counter_list, 2, flags);// Do computational work here …PCLstop(descr,i_result,fp_result,2);printf(“%i fp instructions, MFLOPS: %f\n”,i_result[0], fp_result[1]);}PCLexit(descr);return 0;}Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 52

Using DCPIAlpha-based machines running Tru64 UNIXHow to proceed when using DCPI1. Start the DCPI daemon (dcpid)2. Run your code3. Stop the DCPI daemon4. Use DCPI tools to analyze the profilingdataProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 53

Examples of DCPI toolsdcpiwhatcg: Where have all the cyclesgone?dcpiprof: Breakdown of CPU time byproceduresdcpilist: Code listing (source/assembler)annotated with profiling datadcpitopstalls: Ranking of instructionscausing stall cyclesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 54

Using DCPI – Example 1% dcpiprof ./mgColumn Total Period (for events)------ ----- ------dmiss 45745 4096===================================================dmiss % cum% procedure image33320 72.84% 72.84% mgSmooth ./mg10008 21.88% 94.72% mgRestriction ./mg2411 5.27% 99.99% mgProlongCorr ./mg[…]Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 55

Using DCPI – Example 2Call the DCPI analysis tool:% dcpiwhatcg ./mgDynamic stalls are listed first:I-cache (not ITB) 0.1% to 7.4%ITB/I-cache miss 0.0% to 0.0%D-cache miss 24.2% to 27.6%DTB miss 53.3% to 57.7%Write buffer 0.0% to 0.3%Synchronization 0.0% to 0.0%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 56

Using DCPI – Example 2Branch mispredict 0.0% to 0.0%IMUL busy 0.0% to 0.0%FDIV busy 0.0% to 0.5%Other 0.0% to 0.0%Unexplained stall 0.4% to 0.4%Unexplained gain -0.7% to -0.7%---------------------------------------Subtotal dynamic 85.1%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 57

Using DCPI – Example 2Static stalls are listed next:Slotting 0.5%Ra dependency 3.0%Rb dependency 1.6%Rc dependency 0.0%FU dependency 0.5%-----------------------------------------------Subtotal static 5.6%-----------------------------------------------Total stall 90.7%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 58

Using DCPI – Example 2Useful cycles are listed in the end:Useful 7.9%Nops 1.3%-----------------------------------------------Total execution 9.3%Compare to the total percentage of stall cycles:90.7% (cf. previous slide)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 59

Part IIOptimization TechniquesforStructured GridsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 60

How to make codes fast1 Use a fast algorithm (e.g., multigrid)I. It does not make sense to optimize a badalgorithmII. However, sometimes a fairly simplealgorithm that is well implemented willbeat a very sophisticated, super methodthat is poorly programmed2 Use good coding practices3 Use good data structures4 Apply appropriate optimization techniquesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 61

Optimization ofFloating-PointOperationsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 62

Optimization of FP operationsLoop unrollingFused Multiply-Add (FMA) instructionsExposing instruction-level parallelism (ILP)Software pipelining (again: exploit ILP)AliasingSpecial functionsEliminating overheads• if statements• Loop overhead• Subroutine calling overheadProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 63

Loop unrollingSimplest effect of loop unrolling: fewer test/jumpinstructions (fatter loop body, less loop overhead)Fewer loads per flopMay lead to threaded code that uses multiple FPunits concurrently (instruction-level parallelism)How are loops handled that have a trip count that isnot a multiple of the unrolling factor?Very long loops may not benefit from unrolling(instruction cache capacity!)Very short loops may suffer from unrolling or benefitstronglyProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 64

Loop unrolling: Making fatter loop bodiesdo i=1,Na(i)= a(i)+b(i)*cenddoii= imod(N,4)do i= 1,iia(i)= a(i)+b(i)*cenddodo i= 1+ii,N,4a(i)= a(i)+b(i)*ca(i+1)= a(i+1)+b(i+1)*ca(i+2)= a(i+2)+b(i+2)*ca(i+3)= a(i+3)+b(i+3)*cenddoExample: DAXPY operationdo i= 1,N,4a(i)= a(i)+b(i)*ca(i+1)= a(i+1)+b(i+1)*ca(i+2)= a(i+2)+b(i+2)*ca(i+3)= a(i+3)+b(i+3)*cenddoPreconditioning loop handlescases when N is not a multiple of 4Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 65

Loop unrolling:Improving flop/load ratioAnalysis of the flop-to-load-ratio often unveils anotherbenefit of unrolling:do i= 1,Ndo j= 1,My(i)=y(i)+a(j,i)*x(j)enddoenddoInnermost loop: three loads andtwo flops performed; i.e., wehave one load per flopProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 66

Loop unrolling:Improving flop/load ratiodo i= 1,N,2t1= 0Both loops unrolled twicet2= 0do j= 1,M,2t1= t1+a(j,i) *x(j) +a(j+1,i) *x(j+1)t2= t2+a(j,i+1)*x(j) +a(j+1,i+1)*x(j+1)enddoy(i) = t1Innermost loop: 8 loads and 8 flops!y(i+1)= t2Exposes instruction-level parallelismenddoHow about unrolling by 4?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityWatch out for register spill!HiPC2003, 12/17/2003 vHPC Cache Aware Methods 67

Fused Multiply-Add (FMA)On many CPUs (e.g., IBM Power3/Power4)there is an instruction which multiplies twooperands and adds the result to a thirdConsider codea= b + c*d + f*gversusa= c*d + f*g + bCan reordering be done automatically?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 68

Exposing ILPprogram nrm1real a(n)tt= 0d0do j= 1,ntt= tt + a(j) * a(j)enddoprint *,ttendprogram nrm2real a(n)tt1= 0d0tt2= 0d0do j= 1,n,2tt1= tt1 + a(j)*a(j)tt2= tt2 + a(j+1)*a(j+1)enddott= tt1 + tt2print *, ttendProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 69

Exposing ILP (cont’d)Superscalar CPUs have a high degree of onchipparallelism that should be exploitedThe optimized code uses temporary variablesto indicate independent instruction streamsThis is more than just loop unrolling!Can this be done automatically?Change in rounding errors?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 70

Software pipeliningArranging instructions in groups that can beexecuted together in one cycleAgain, the idea is to exploit instruction-levelparallelism (on-chip parallelism)Often done by optimizing compilers, but notalways successfullyClosely related to loop unrollingLess important on out-of-order CPUsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 71

AliasingArrays (or other data) that refer to the samememory locationsAliasing rules are different for variousprogramming languages; e.g.,• FORTRAN forbids aliasing: unknown result• C/C++ permit aliasingThis is one reason why FORTRAN compilersoften produce faster code than C/C++compilers doProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 72

Aliasing (cont’d)Example:subroutine sub(n, a, b, c, sum)double precision sum, a(n), b(n), c(n)sum= 0d0do i= 1,na(i)= b(i) + 2.0d0*c(i)enddoreturnendFORTRAN rule: two variables cannot be aliased, when one or bothof them are modified in the subroutineCorrect call: call sub(n,a,b,c,sum)Incorrect call: call sub(n,a,a,c,sum)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 73

Aliasing (cont’d)Aliasing is legal in C/C++: compiler mustproduce conservative codeMore complicated aliasing is possible; e.g.,a(i) with a(i+2)C/C++ keyword restrict or compiler option-noaliasProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 74

Special functions / (divide) sqrt exp, log sin, cos, … Etc.are expensive (up to several dozen cycles)Use math Identities, e.g., log(x) + log(y) = log(x*y)Use special libraries that• vectorize when many of the same functionsmust be evaluated• trade accuracy for speed, when appropriateProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 75

Eliminating overheads:if statementsif statements … Prohibit some optimizations(e.g., loop unrolling in some cases) Evaluating the condition expression takes time CPU pipeline may be interrupted (dynamic jump prediction)Goal: avoid if statements in the innermost loopsNo generally applicable technique exists Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 76

Eliminating if statements:An examplesubroutine thresh0(n,a,thresh,ic)dimension a(n)ic= 0tt= 0.d0do j= 1,ntt= tt + a(j) * a(j)if (sqrt(tt).ge.thresh) thenic= jreturnendifenddoreturnendAvoid sqrt in condition!(square thresh instead)Add tt in blocks of 128 forexample (without condition) andrepeat last block when conditionis violatedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 77

Eliminating loop overheadsFor starting a loop, the CPU must free certainregisters: loop counter, address, etc.This may be significant for a short loop!Example: for n>mdo i= 1,ndo j= 1,m...is less efficient thando j= 1,mdo i= 1,n...However, data access optimizations are even moreimportant, see belowProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 78

Eliminating subroutine callingoverheadSubroutines (functions) are very important forstructured, modular programmingSubroutine calls are expensive (on the order of up to100 cycles)Passing value arguments (copying data) can beextremely expensive, when used inappropriatelyPassing reference arguments (as in FORTRAN) maybe dangerous from a point of view of correct softwareReference arguments (as in C++) with constdeclarationGenerally, in tight loops, no subroutine calls shouldbe usedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 79

Eliminating subroutine callingoverhead (cont’d)Inlining: inline declaration in C++ (see below), ordone automatically by the compilerMacros in C or any other language#define sqre(a) (a)*(a)What can go wrong:sqre(x+y) x+y*x+ysqre(f(x)) f(x) * f(x)What if f has side effects?What if f has no side effects, but the compiler cannotdeduce that?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 80

Memory HierarchyOptimizations:Data LayoutProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 81

Data layout optimizationsArray transpose to get stride-1 accessBuilding cache-aware data structures byarray mergingArray paddingEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 82

Data layout optimizationsStride-1 access is usually fastest for severalreasons; particularly the reuse of cache linecontentsData layout for multidimensional arrays inFORTRAN: column-major orderExample:4x3 arrayA(i,j)MemoryaddressProf. Craig C. DouglasUniversity of Kentucky andYale University840A(1,1)A(1,2)A(1,3)A(2,1)A(2,2)A(2,3)A(3,1)A(3,2)A(3,3)A(4,1)A(4,2)A(4,3)Data arrangement is “transpose” of usualmatrix layoutHiPC2003, 12/17/2003 vHPC Cache Aware Methods 83

Data layout optimizationsStride-1 access: innermost loop iterates over first index Either by choosing the right data layout (arraytranspose) or By arranging nested loops in the right order (loopinterchange):do i=1,Ndo j=1,Ma(i,j)=a(i,j)+b(i,j)enddoenddoProf. Craig C. DouglasUniversity of Kentucky andYale Universitydo j=1,Mdo i=1,Na(i,j)=a(i,j)+b(i,j)enddoenddoStride-N accessStride-1 accessThis will usually be done by the compiler!HiPC2003, 12/17/2003 vHPC Cache Aware Methods 84

Data layout optimizations:Stride-1 accessdo i=1,Ndo j=1,Ms(i)=s(i)+b(i,j)*c(j)enddoenddoBetter transpose matrix b sothat inner loop gets stride 1How about loop interchange in this case?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 85

Data layout optimizations:Cache-aware data structuresIdea: Merge data which are needed togetherto increase spatial locality: cache linescontain several data itemsExample: Gauss-Seidel iteration, determinedata items needed simultaneously⎛−1= ⎜− ∑k + 1a− ∑i , ifiai , juj⎝ j < ij > ik + 1kuiai , juj⎞⎟⎠Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 86

Data layout optimizations:Cache-aware data structuresExample (cont’d): right-hand side andcoefficients are accessed simultaneously,reuse cache line contents by array merging enhance spatial localitytypedef struct {double f;double c_N, c_E, c_S, c_W, c_C;} equationData; // Data merged in memorydouble u[N][N];// Solution vectorequationData rhsAndCoeff[N][N]; // Right-hand side// and coefficientsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 87

Data layout optimizations:Array paddingIdea: Allocate arrays larger than necessary• Change relative memory distances• Avoid severe cache thrashing effectsExample (FORTRAN: column-major order):Replacedouble precision u(1024, 1024)bydouble precision u(1024+pad, 1024)How to choose pad?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 88

Data layout optimizations:Array paddingC.-W. Tseng et al. (UMD):Research on cache modeling and compilerbasedarray padding:• Intra-variable padding: pad within arrays⇒ Avoid self-interference misses• Inter-variable padding: pad betweendifferent arrays⇒ Avoid cross-interference missesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 89

Data layout optimizations:Array paddingPadding in 2D; e.g., FORTRAN77:double precision u(0:1024+pad,0:1024)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 90

Memory HierarchyOptimizations:Data AccessProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 91

Loop optimizationsLoop unrolling (see above)Loop interchangeLoop fusionLoop split = loop fission = loop distributionLoop skewingLoop blockingEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 92

Data access optimizations:Loop fusion• Idea: Transform successive loops into asingle loop to enhance temporal locality• Reduces cache misses and enhances cachereuse (exploit temporal locality)• Often applicable when data sets areprocessed repeatedly (e.g., in the case ofiterative methods)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 93

Data access optimizations:Loop fusionBefore:do i= 1,Na(i)= a(i)+b(i)enddodo i= 1,Na(i)= a(i)*c(i)enddo• a is loaded into thecache twice (ifsufficiently large)After:do i= 1,Na(i)= (a(i)+b(i))*c(i)enddo• a is loaded into thecache only onceProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 94

Data access optimizations:Loop fusionExample: red/black Gauss-Seidel iteration in 2DProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 95

Data access optimizations:Loop fusionCode before applying loop fusion technique(standard implementation w/ efficient loopordering, Fortran semantics: row major order):for it= 1 to numIter do// Red nodesfor i= 1 to n-1 dofor j= 1+(i+1)%2 to n-1 by 2 dorelax(u(j,i))end forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 96

Data access optimizations:Loop fusion// Black nodesfor i= 1 to n-1 dofor j= 1+i%2 to n-1 by 2 dorelax(u(j,i))end forend forend forThis requires two sweeps through the wholedata set per single GS iteration!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 97

Data access optimizations:Loop fusionHow the fusion technique works:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 98

Data access optimizations:Loop fusionCode after applying loop fusion technique:for it= 1 to numIter do// Update red nodes in first grid rowfor j= 1 to n-1 by 2 dorelax(u(j,1))end forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 99

Data access optimizations:Loop fusion// Update red and black nodes in pairsfor i= 1 to n-1 dofor j= 1+(i+1)%2 to n-1 by 2 dorelax(u(j,i))relax(u(j,i-1))end forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 100

Data access optimizations:Loop fusion// Update black nodes in last grid rowfor j= 2 to n-1 by 2 dorelax(u(j,n-1))end forSolution vector u passes through thecache only once instead of twice per GSiteration!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 101

Data access optimizations:Loop splitThe inverse transformation of loop fusionDivide work of one loop into two to makebody less complicated• Leverage compiler optimizations• Enhance instruction cache utilizationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 102

Data access optimizations:Loop blockingLoop blocking = loop tilingDivide the data set into subsets (blocks)which are small enough to fit in cachePerform as much work as possible on thedata in cache before moving to the next blockThis is not always easy to accomplishbecause of data dependenciesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 103

Data access optimizations:Loop blockingExample: 1D blocking for red/black GS, respectthe data dependencies!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 104

Data access optimizations:Loop blockingCode after applying 1D blocking techniqueB = number of GS iterations to beblocked/combinedfor it= 1 to numIter/B do// Special handling: rows 1, …, 2B-1// Not shown here …Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 105

Data access optimizations:Loop blocking// Inner part of the 2D gridfor k= 2*B to n-1 dofor i= k to k-2*B+1 by –2 dofor j= 1+(k+1)%2 to n-1 by 2 dorelax(u(j,i))relax(u(j,i-1))end forend forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 106

Data access optimizations:Loop blocking// Special handling: rows n-2B+1, …, n-1// Not shown here …end forResult: Data is loaded once into the cacheper B Gauss-Seidel iterations, if 2*B+2 gridrows fit in the cache simultaneouslyIf grid rows are too large, 2D blocking can beappliedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 107

Data access optimizationsLoop blockingMore complicated blocking schemes existIllustration: 2D square blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 108

Data access optimizations:Loop blockingIllustration: 2D skewed blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 109

Two common multigridalgorithmsV Cycle to solve A 4 u 4 =f 4Smooth A 4u 4=f 4. Set f 3= R 3r 4.Smooth A 3u 3=f 3. Set f 2= R 2r 3.Smooth A 2u 2=f 2. Set f 1= R 1r 2.Solve A 1u 1=f 1directly.Set u 4= u 4+ I 3u 3. Smooth A 4u 4=f 4.Set u 3= u 3+ I 2u 2. Smooth A 3u 3=f 3.Set u 2= u 2+ I 1u 1. Smooth A 2u 2=f 2.W CycleProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 110

Cache-optimized multigrid:DiMEPACK libraryDFG project DiME: Data-local iterative methodsFast algorithm + fast implementationCorrection scheme: V-cycles, FMGRectangular domainsConstant 5-/9-point stencilsDirichlet/Neumann boundary conditionsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 111

DiMEPACK libraryC++ interface, fast Fortran77 subroutinesDirect solution of the problems on thecoarsest grid (LAPACK: LU, Cholesky)Single/double precision floating-pointarithmeticVarious array padding heuristics (Tseng)http://www10.informatik.uni-erlangen.de/dimeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 112

V(2,2) cycle - bottom lineMflops1356150220For whatStandard 5-pt. OperatorCache optimized (loop orderings, datamerging, simple blocking)Constant coeff. + skewed blocking +paddingEliminating rhs if 0 everywhere butboundaryProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 113

Example:Cache-OptimizedMultigrid on RegularGrids in 3DProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 114

Data layout optimizationsfor 3D multigridArray paddingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 115

Data layout optimizationsfor 3D multigridStandard padding in 3D; e.g., FORTRAN77:double precision u(0:1024,0:1024,0:1024)becomes:double precision u(0:1024+pad1,0:1024+pad2,0:1024)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 116

Data layout optimizationsfor 3D multigrid Non-standard padding in 3D:double precision u(0:1024+pad1,0:1024,0:1024)...u(i+k*pad2, j, k)(or use hand-made index linearization – performance effect?)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 117

Data layout optimizationsfor 3D multigridArray mergingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 118

Data access optimizationsfor 3D multigrid1-way blocking with loop-interchangeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 119

Data access optimizationsfor 3D multigrid2-way blocking and 3-way blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 120

Data access optimizationsfor 3D multigrid4-way blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 121

Example:Cache Optimizations forthe Lattice BoltzmannMethodProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 122

Lattice Boltzmann method Mainly used in CFD applications Employs a regular grid structure (2D, 3D) Particle-oriented approach based on amicroscopic model of the moving fluidparticles Jacobi-like cell update pattern: a single timestep of the LBM consits of• stream step and• collide stepProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 123

LBM (cont‘d)Stream: read distribution functions fromneighborsCollide: re-compute own distribution functionsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 124

Data layout optimizationLayout 1:Two separate grids(standard approach)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 125

Layout 2: Grid Compression: save memory, enhance localityProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 126

Data accessoptimizationsAccess pattern 1: 3-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 127

3-way blocking (cont’d):Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 128

Access pattern 2: 4-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 129

4-way blocking (cont’d):Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 130

Illustration of the combination of layout + access optimizationsLayout: separate grids, access pattern: 3-way-blocking:Layout: separate grids, access pattern: 4-way-blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 131

Layout: Grid compression, access pattern: 3-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 132

Layout: grid compression, access pattern: 4-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 133

Performance resultsMFLOPS for 2D GS, const. coeff.s, 5-pt.,DEC PWS 500au, Alpha 21164 CPU, 500 MHzProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 134

Memory access behaviorDigital PWS 500au, Alpha 21164 CPUL1 = 8 KB, L2 = 96 KB, L3 = 4 MBWe use DCPI to obtain the performance dataWe measure the percentage of accesses which aresatisfied by each individual level of the memoryhierarchyComparison: standard implementation of red/blackGS (efficient loop ordering) vs. 2D skewed blocking(with and without padding)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 135

Memory access behaviorStandard implementation of red/black GS,without array paddingSize+/-L1L2L3Mem.334.563.632.00.00.0650.575.723.60.20.0129-0.276.19.314.80.02575.355.125.014.50.05133.937.745.212.40.810255.127.850.09.97.220494.530.345.013.07.2Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 136

Memory access behavior2D skewed blocking without array padding,4 iterations blocked (B = 4)Size336512925751310252049+/-27.433.436.938.138.036.936.2L143.446.342.334.128.324.925.5L229.119.519.125.127.019.70.4L30.10.91.72.76.717.636.9Mem.0.00.00.00.00.10.90.9Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 137

Memory access behavior2D skewed blocking with appropriate arraypadding, 4 iterations blocked (B = 4)Size+/-L1L2L3Mem.3328.266.45.30.00.06534.355.79.10.90.012937.551.79.01.90.025737.852.87.02.30.051338.452.76.22.40.3102536.754.36.12.00.9204935.955.26.01.90.9Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 138

Performance results (cont’d)3D MG, F77, var. coeff.s, 7-pt., Intel Pentium4,2.4 GHz, Intel ifc V7.0 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 139

Performance results (cont’d)2D LBM (D2Q9), C(++), AMD Athlon XP 2400+,2.0 GHz, Linux, gcc V3.2.1 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 140

Performance results (cont’d)Cache behavior (left: L1, right: L2) for previousexperiment, measured with PAPIProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 141

Performance results (cont’d)3D LBM (D3Q19), C, AMD Opteron, 1.6 GHz,Linux, gcc V3.2.2 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 142

C++-SpecificConsiderationsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 143

C++-specific considerationsWe will (briefly) address the following issues:InliningVirtual functionsExpression templatesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 144

InliningMacro-like code expansion: replace functioncall by the body of the function to be inlinedHow to accomplish inlining:• Use C++ keyword inline, or• Define the method within the declarationIn any case: the method to be inlined needsto be defined in the header fileHowever: inlining is just a suggestion to thecompiler!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 145

Inlining (cont‘d)Advantages:• Reduce function call overhead (see above)• Leverage cross-call optimizations: optimizethe code after expanding the loop body Disadvantage:Size of the machine code increases (instructioncache capacity!)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 146

Virtual functionsMember functions may be declared to be virtual(C++ keyword virtual)This mechanism becomes relevant when base classpointers are used to point to instances of derivedclassesActual member function to be called can often bedetermined only at runtime (polymorphism)Requires virtual function table lookup (at runtime!)Can be very time-consuming!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 147

Inlining virtual functionsVirtual functions are often not compatible withinlining where function calls are replaced byfunction bodies at compile time.If the type of the object can be deduced atcompile time, the compiler can even inlinevirtual functions (at least theoretically ...)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 148

Expression templatesC++ technique for passing expressions asfunction argumentsExpression can be inlined into the functionbody using (nested) C++ templatesAvoid the use of temporaries and thereforemultiple passes of the data through thememory subsystem; particularly the cachehierarchyProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 149

ExampleDefine a simple vector class in the beginning:class vector {private:int length;double a[];public:vector(int l);double component(int i) { return a[i]; }...};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 150

Example (cont‘d)Want to efficiently compute vector sums likec= a+b+d;Efficiently implies Avoiding the generation of temporary objects Pumping data through the memory hierarchyseveral times. This is actually the timeconsumingpart. Moving data is moreexpensive than processing data!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 151

Example (cont‘d)Need a wrapper class for all expressions:templateclass DExpr { // double precision expressionprivate:A wa;public:DExpr(const A& a) : wa(a) {}double component(int i) { return wa.component(i); }};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 152

Example (cont‘d)Need an expression template class to represent sumsof expressionstemplateclass DExprSum {A va;B vb;public:DExprSum(const A& a, const B& b) : va(a), vb(b) {}double component(int i) {return va.component(i) + vb.component(i);}};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 153

Example (cont‘d)Need overloaded operator+() variants for allpossible return types, for example:templateDExproperator+(const DExpr& a, const DExpr& b) {typedef DExprSum ExprT;return DExpr(ExprT(a,b));};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 154

Example (cont‘d)The vector class must contain a memberfunction operator=(const A& ea), where A isan expression template class.Only when this member function is called, theactual computation (vector sum) takes place.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 155

Part IIIOptimization Techniques for Unstructured GridsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 156

Optimizations forunstructured gridsHow unstructured is the grid?Sparse matrices and data flow analysisGrid processingAlgorithm processingExamplesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 157

Is It really unstructured?This is really a quasi-unstructured mesh: there is plenty of structure inmost of the oceans. Coastal areas provide a real challenge.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 158

Subgrids and patchesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 159

Motivating example• Suppose probleminformation for only half ofnodes fits in cache.• Gauss-Seidel updates nodesin order• Leads to poor use of cache• By the time node 37 isupdated, information fornode 1 has probablybeen evicted from cache.• Each unknown must bebrought into cache ateach iteration.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 160

Motivating example• Alternative• Divide into two connectedsubsets.• Renumber• Update as much as possiblewithin a subset beforevisiting the other.• Leads to better data reusewithin cache.• Some unknowns can becompletely updated.• Some partial residuals can becalculated.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 161

Cache aware Gauss-SeidelPreprocessing phase• Decompose each mesh into disjoint cache blocks.• Renumber the nodes in each block.• Find structures in quasi-unstructured case.• Produce the system matrices and intergrid transfer operatorswith the new ordering.Gauss-Seidel phase• Update as much as possible within a block without referencingdata from another block.• Calculate a (partial) residual on the last update.• Backtrack to finish updating cache block boundaries.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 162

Preprocessing: Mesh decompositionGoals• Maximize interior of cache block.• Minimize connections between cache blocks.Constraint• Cache should be large enough to hold the part of matrix, righthand side, residual, and unknown associated with a cache block.Critical parameter: usable cache size.Such decomposition problems have been studied in depth for loadbalancing parallel computation.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 163

Example of subblock membershipCache blocks identifiedSubblocks identifiedˆ 2Ω hCache block2L 12L 2L 3boundary2•• •ˆ 1hΩProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 164

Distance algorithmsSeveral constants• d The degrees of freedom per vertex.• K Average number of connections per vertex.• N Ω Number of vertices in cache block N Ω .Three cases for complexity bounds• Cache boundaries connected.• Physical boundaries unknown.• Physical boundaries known.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 165

Standard Gauss-SeidelThe complexity bound C gs in this notation is given byC gs ≤ 2d 2 N Ω K+d.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 166

Cache boundaries connectedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 167

Cache boundaries connectedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 168

Cache boundaries connectedThe complexities of Algorithms 1 and 2 areandC 1 ≤ 5N Ω KC 2 ≤ (7K+1)N Ω .The cost of Algorithms 1 and 2 with respect toGauss-Seidel is at most 6d -2 sweeps on the finest grid.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 169

Physical boundaries unknownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 170

Physical boundaries unknownThe complexity of Algorithm 3 isC 1 ≤ 9N Ω K.The cost of Algorithms 1 and 3 with respect toGauss-Seidel is at most 7d -2 sweeps on the finest grid.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 171

Physical boundaries knownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 172

Physical boundaries knownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 173

Physical boundaries knownThe complexity of Algorithms 1, 4, and 5 isC 1,4,5 ≤ (11K+1)N Ω .The cost of Algorithms 1, 4, and 5 with respect to Gauss-Seidelis at most 10d -2 sweeps on the finest grid. This assumes thatthe number of physical boundary nodes is (N Ω ) ½ , which is quitepessimistic.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 174

Physical boundaries knownIf we assume that there are no more than ½N Ω physical boundarynodes per cache block, then we get a better bound, namely, the costof Algorithms 1, 4, and 5 with respect to Gauss-Seidel is at most5.5d -2 sweeps on the finest grid.Clearly with better estimates of realistic number of physicalboundary nodes (i.e., much less than ½N Ω ) per cache block, we canreduce this bound.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 175

Preprocessing costsPessimistic (never seen to date in a real problem) bounds• d = 1: 5.5 Gauss-Seidel sweeps• d = 2: 1.375 Gauss-Seidel sweeps• d = 3: 0.611 Gauss-Seidel sweepsIn practice bounds• d = 1: ~1 Gauss-Seidel sweeps• d = 2: ~ ½ Gauss-Seidel sweepsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 176

Numerical results: AustriaExperiment: AustriaTwo dimensional elasticityT = Cauchy stress tensor, w = displacementf = (1,-1) T on Γ 4 ,(9.5-x,4-y) if (x,y) is in region surrounded by Γ 5 , and0 otherwise.–∇T = f in Ω,∂w/∂n = 100w on Γ 1 ,∂w/∂y = 100w on Γ 2 ,∂w/∂x = 100w on Γ 3 ,∂w/∂n = 0 everywhere else.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 177

Numerical experiments: BavariaExperiment: BavariaCoarse grid meshStationary heat equationwith 7 sources and one sink (MunichOktoberfest).Homogeneous Dirichlet boundaryconditions on the Czech border(northeast), homogeneous Neumannb.c.’s everywhere else.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 178

Code commentaryPreprocessing steps separate.Computation• Standard algorithms implemented in as efficient a manner knownwithout the cache aware or active set tricks. This is not RandyBank’s version, but much, much faster.• Cache aware implementations uses quasi-unstructuredinformation.• The codes work equally well in 2D and 3D (the latter needs alarger cache than 2D) and are really general sparse matrix codeswith a PDE front end.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 179

Implementation detailsOne parameter tuning: usable cache size, which is normally ~60% ofphysical cache size.Current code• Fortran + C (strict spaghetti coding style and fast).• Requires help from authors to add a new domain and coarsegrid.New code• C++• Should not require any author to be present.• Is supposed to be available by end of September and willprobably be done earlier. Aiming for inclusion in ML (seesoftware.sandia.gov)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 180

Books W. Briggs, V.E. Henson, S.M. McCormick, A Multigrid Tutorial,SIAM, 2000. D. Bulka, D. Mayhew, Efficient C++, Addison-Wesley, 2000. C.C. Douglas, G. Haase, U. Langer, A Tutorial on Elliptic PDESolvers and Their Parallelization, SIAM, 2003. S. Goedecker, A. Hoisie: Performance Optimization ofNumerically Intensive Codes, SIAM, 2001. J. Handy, The Cache Memory Book, Academic Press, 1998J. L. Hennessy and D. A. Patterson, Computer Architecture: AQuantitative Approach, 2 nd ed., Morgan Kauffmann Publishers,1996.U. Trottenberg, A. Schuller, C. Oosterlee, Multigrid, AcademicPress, 2000.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 181

Journal articles C.C. Douglas, Caching in with multigrid algorithms: problems intwo dimensions Parallel Algorithms and Applications, 9 (1996),pp. 195-204. C.C. Douglas, G. Haase, J. Hu, W. Karl, M. Kowarschik, U.Rüde, C. Weiss, Portable memory hierarchy techniques for PDEsolvers, Part I, SIAM News 33/5 (2000), pp. 1, 8-9. Part II, SIAMNews 33/6 (2000), pp. 1, 10-11, 16. C. C. Douglas, J. Hu, W. Karl, M. Kowarschik, U. Rüde, C.Weiss, Cache optimization for structured and unstructured gridmultigrid, Electronic Transactions on Numerical Analysis, 10(2000), pp. 25-40. C. Weiss, M. Kowarschik, U. Rüde, and W. Karl, Cache-awaremultigrid methods for solving Poisson's equation in twodimension, Computing, 64(2000), pp. 381-399.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 182

Conference proceedingsM. Kowarschik, C. Weiss, DiMEPACK – A cache-optimizedmultigrid library, Proc. of the Intl. Conference on Parallel andDistributed Processing Techniques and Applications (PDPTA2001), vol. I, June 2001, pp. 425-430.U. Rüde, Iterative algorithms on high performance architectures,Proc. of the EuroPar97 Conference, Lecture Notes in ComputerScience, Springer, Berlin, 1997, pp. 57-71.U. Rüde, Technological trends and their Impact on the future ofsupercomputing, H.-J. Bungartz, F. Durst, C. Zenger (eds.),High Performance Scientific and Engineering Computing,Lecture Notes in Computer Science, Vol. 8, Springer, 1998, pp.459-471.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 183

Conference proceedingsC. Weiss, W. Karl, M. Kowarschik, U. Rüde, Memorycharacteristics of iterative methods, Proc. of theSupercomputing Conference, Portland, Oregon, November1999.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 184

Related websiteshttp://www.mgnet.orghttp://www.mgnet.org/~douglas/ccd-preprints.htmlhttp://www.mgnet.org/~douglas/ml-dddas.htmlhttp://www10.informatik.uni-erlangen.de/dimehttp://valgrind.kde.orghttp://www.fz-juelich.de/zam/PCLhttp://icl.cs.utk.edu/projects/papihttp://www.hipersoft.rice.edu/hpctoolkithttp://www.tru64unix.compaq.com/dcpihttp://ec-securehost.com/SIAM/SE16.htmlProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC Cache Aware Methods 185

Cache Usage Tutorial - MGNet

Create successful ePaper yourself

Delete template?

Save as template?