12.07.2015 Views

Cache Usage Tutorial - MGNet

Cache Usage Tutorial - MGNet

Cache Usage Tutorial - MGNet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Part IArchitectures and FundamentalsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 4


Architectures and fundamentalsWhy worry about performance- an illustrative exampleFundamentals of computer architecture• CPUs, pipelines, superscalarity• Memory hierarchyBasic efficiency guidelinesProfilingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 5


How fast should a solver be?(just a simple check with theory)Poisson problem can be solved by a multigridmethod in < 30 operations per unknown(known since late 70’s)More general elliptic equations may needO(100) operations per unknownA modern CPU can do 1-6 GFLOPSSo we should be solving 10-60 millionunknowns per secondShould need O(100) Mbytes of memoryProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 6


How fast are solvers today?Often no more than 10,000 to 100,000 unknownspossible before the code breaksIn a time of minutes to hoursNeeding horrendous amounts of memoryEven state of the art codesare often very inefficientProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 7


Comparison of solvers(what got me started in this business ~ '95)Compute Time in Seconds1010.10.01Unstructured codeSORStructured grid MultigridOptimal MultigridProf. Craig C. DouglasUniversity of Kentucky andYale University1024 4096 16384HiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 8


Elements of CPU architectureModern CPUs are• Superscalar: they can execute more than oneoperation per clock cycle, typically:oo4 integer operations per clock cycle plus2 or 4 floating-point operations (multiply-add)• Pipelined:ooFloating-point ops take O(10) clock cycles to completeA set of ops can be started in each cycle• Load-store: all operations are done on data inregisters, all operands must be copied to/frommemory via load and store operationsCode performance heavily dependent on compiler(and manual) optimizationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 9


PipeliningProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 10


Pipelining (cont’d)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 11


CPU trendsEPIC (similar to VLIW) (IA64)Multi-threaded architectures (Alpha, Pentium4HT) Multiple CPUs on a single chip (IBM Power 4)Within the next decade• Billion transistor CPUs (today 200 milliontransistors)• Potential to build TFLOPS on a chip (e.g., SUNgraphics processors)• But no way to move the data in and out sufficientlyquickly!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 12


Memory wallLatency: time for memory to respond to a read (orwrite) request is too long• CPU ~ 0.5 ns (light travels 15cm in vacuum)• Memory ~ 50 nsBandwidth: number of bytes which can be read(written) per second• CPUs with 1 GFLOPS peak performancestandard: needs 24 Gbyte/sec bandwidth• Present CPUs have peak bandwidth


Memory acceleration techniquesInterleaving (independent memory banks storeconsecutive cells of the address space cyclically)• Improves bandwidth• But not latency<strong>Cache</strong>s (small but fast memory) holding frequentlyused copies of the main memory• Improves latency and bandwidth• Usually comes with 2 or 3 levels nowadays• But only works when access to memory is localProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 14


Principles of locality Temporal locality: an item referenced now will bereferenced again soon Spatial locality: an item referenced now indicates thatneighbors will be referenced soon <strong>Cache</strong> lines are typically 32-128 bytes with 1024being the longest recently. Lines, not words, aremoved between memory levels. Both principles aresatisfied. There is an optimal line size based on theproperties of the data bus and the memorysubsystem designs.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 15


<strong>Cache</strong>sFast but small extra memoryHolding identical copies of main memoryLower latencyHigher bandwidth Usually several levels (2, 3, or 4)Same principle as virtual memoryMemory requests are satisfied from• Fast cache (if it holds the appropriate copy):<strong>Cache</strong> Hit• Slow main memory (if data is not in cache):<strong>Cache</strong> MissProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 16


Typical cache configurationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 17


<strong>Cache</strong> issuesUniqueness and transparency of the cacheFinding the working set (what data is kept incache)Data consistency with main memoryLatency: time for memory to respond to aread (or write) requestBandwidth: number of bytes that can be read(written) per secondProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 18


<strong>Cache</strong> issues (cont’d)<strong>Cache</strong> line size• Prefetching effect• False sharing (cf. associativity issues)Replacement strategy• Least Recently Used (LRU)• Least Frequently Used (LFU)• RandomTranslation lookaside buffer (TLB)• Stores virtual memory page translation entries• Has effect similar to another level of cache• TLB misses are very expensiveProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 19


Effect of cache hit ratioThe cache efficiency is characterized by thecache hit ratio, the effective time for a dataaccess isThe speedup is then given byProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 20


<strong>Cache</strong> effectiveness depends onthe hit ratioHit ratios of 90% and better are neededfor good speedupsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 21


<strong>Cache</strong> organizationNumber of cache levelsSet associativityPhysical or virtual addressingWrite-through/write-back policyReplacement strategy (e.g., Random/LRU)<strong>Cache</strong> line sizeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 22


<strong>Cache</strong> associativity Direct mapped (associativity = 1)oEach cache block can be stored in exactly onecache line of the cache memoryFully associativeoA cache block can be stored in any cache line Set-associative (associativity = k)oEach cache block can be stored in one of kplaces in the cacheDirect mapped and set-associative caches give rise toconflict misses.Direct mapped caches are faster, fully associative cachesare too expensive and slow (if reasonably large).Set-associative caches are a compromise.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 23


Typical architectures IBM Power 3:• L1 = 64 KB, 128-way set associative (funny definition, however)• L2 = 4 MB, direct mapped, line size = 128, write back IBM Power 4 (2 CPU/chip):• L1 = 32 KB, 2-way, line size = 128• L2 = 1,5 MB, 8-way, line size = 128• L3 = 32 MB, 8-way, line size = 512 Compaq EV6 (Alpha 21264):• L1 = 64 KB, 2-way associative, line size= 32• L2 = 4 MB (or larger), direct mapped, line size = 64 HP PA-RISC:• PA8500, PA8600: L1 = 1.5 MB, PA8700: L1 = 2.25 MB• no L2 cache!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 24


Typical architectures (cont’d) AMD Athlon (from “Thunderbird” on):• L1 = 64 KB, L2 = 256 KB Intel Pentium 4:• L1 = 8 KB, 4-way, line size = 64• L2 = 256 KB up to 2MB, 8-way, line size = 128 Intel Itanium:• L1 = 16 KB, 4-way• L2 = 96 KB, 6-way• L3: off-chip, size varies Intel Itanium 2 (McKinley / Madison):• L1 = 16 / 32 KB• L2 = 256 / 256 KB• L3: 1.5 or 3 / 6 MBProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 25


Basic efficiency guidelinesChoose the best algorithmUse efficient librariesFind good compiler optionsUse suitable data layoutsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 26


Choose the best algorithmExample: Solution of linear systems arisingfrom the discretization of a special PDEGaussian elimination (standard): n 3 /3 opsBanded Gaussian elimination: 2n 2 opsSOR method: 10n 1.5 opsMultigrid method: 30n opsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 27


Choose the best algorithm(cont‘d)For n large, the multigrid method will alwaysoutperform the others, even if it is badlyimplementedFrequently, however, two methods haveapproximately the same complexity, and thenthe better implemented one will winProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 28


Use efficient librariesGood libraries often outperform own softwareClever, sophisticated algorithmsOptimized for target machineMachine-specific implementationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 29


Sources for librariesVendor-independent• Commercial: NAG, IMSL, etc.; onlyavailable as binary, often optimized forspecific platform• Free codes: e.g., NETLIB (LAPACK,ODEPACK, …), usually as source code,not specifically optimizedVendor-specific; e.g., cxml for HP Alpha withhighly tuned LAPACK routinesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 30


Sources for libraries (cont‘d)Many libraries are quasi-standards• BLAS• LAPACK• etc.Parallel libraries for supercomputersSpecialists can sometimes outperformvendor-specific librariesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 31


Find good compiler optionsModern compilers have numerous flags to selectindividual optimization options• -On: successively more aggressiveoptimizations, n=1,...,8• -fast: may change round-off behavior• -unroll• -arch• Etc.Learning about your compiler is usually worth it:RTFM (which may be hundreds of pages long).Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 32


Find good compiler options(cont‘d)Hints: Read man cc (man f77) or cc –help (orwhatever causes the possible options to print) Look up compiler options documented inwww.specbench.org for specific platforms Experiment and compare performance onyour own codesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 33


Use suitable data layoutAccess memory in order! In C/C++, for a 2Dmatrixdouble a[n][m];the loops should be such thatfor (i...)for (j...)a[i][j]...In FORTRAN, it must be the other way roundApply loop interchange if necessary (see below)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 34


Use suitable data layout (cont‘d)Other example: array mergingThree vectors accessed together (in C/C++):double a[n],b[n],c[n];can often be handled more efficiently by usingdouble abc[n][3];In FORTRAN again indices permutedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 35


ProfilingSubroutine-level profiling• Compiler inserts timing calls at thebeginning and end of each subroutine• Only suitable for coarse code analysis• Profiling overhead can be significant• E.g., prof, gprofProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 36


Profiling (cont‘d)Tick-based profiling• OS interrupts code execution regularly• Profiling tool monitors code locations• More detailed code analysis is possible• Profiling overhead can still be significantProfiling using hardware performance monitors• Most popular approach• Will therefore be discussed next in more detailProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 37


Profiling: hardware performancecountersDedicated CPU registers are used to countvarious events at runtime:Data cache misses (for different levels)Instruction cache missesTLB missesBranch mispredictionsFloating-point and/or integer operationsLoad/store instructionsEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 38


Profiling tools: DCPIDCPI = Digital Continuous ProfilingInfrastructure (still supported by currentowner, HP; source is even available)Only for Alpha-based machines runningTru64 UNIXCode execution is watched by a profilingdaemonCan only be used from outside the codehttp://www.tru64unix.compaq.com/dcpiProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 39


Profiling tools: valgrindMemory/thread debugger and cache profiler(4 tools). Part of KDE project: free.Run usingcachegrind programNot an intrusive library, uses hardwarecapabilities of CPUs.Simple to use (even for automatic testing).Julian Seward et alhttp://valgrind.kde.orgProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 40


Profiling tools: PCLPCL = Performance Counter LibraryR. Berrendorf et al., FZ Juelich, GermanyAvailable for many platforms (Portability!)Usable from outside and from inside the code(library calls, C, C++, Fortran, and Javainterfaces)http://www.fz-juelich.de/zam/PCLProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 41


Profiling tools: PAPIPAPI = Performance APIAvailable for many platforms (Portability!)Two interfaces:• High-level interface for simplemeasurements• Fully programmable low-level interface,based on thread-safe groups of hardwareevents (EventSets)http://icl.cs.utk.edu/projects/papiProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 42


Profiling tools: HPCToolkitHigh level portable tools for performancemeasurements and comparisons• Uses browser interface• PAPI should have looked like this• Make a change, check what happens onseveral architectures at oncehttp://www.hipersoft.rice.edu/hpctoolkitRob Fowler et al, Rice University, USAProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 43


hpcview Screen ShotProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 44


HPCToolkit Philosophy 1Intuitive, top down user interface forperformance analysis• Machine independent tools and GUIoStatistics to XML converters• Language independenceoNeed a good symbol locator at run time• Eliminate invasive instrumentation• Cross platform comparisonsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 45


HPCToolkit Philosophy 2Provide information needed for analysis andtuning• Multilanguage applications• Multiple metricsoMust compare metrics which are causes versuseffects (examples: misses, flops, loads,mispredicts, cycles, stall cycles, etc.)• Hide getting details from user as much aspossibleProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 46


HPCToolkit Philosophy 3Eliminate manual labor from analyze, tune,run cycle• Collect multiple data automatically• Eliminate 90-10 ruleo90% of cycles in 10% of code… for a 500K linecode, the hotspot is only 5,000 lines of code.How do you deal with a 5K hotspot???• Drive the process with simple scriptsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 47


Our reference code2D structured multigrid code written in CDouble precision floating-point arithmetic5-point stencilsRed/black Gauss-Seidel smootherFull weighting, linear interpolationDirect solver on coarsest grid (LU, LAPACK)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 48


Structured gridProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 49


Using PCL – Example 1Digital PWS 500au• Alpha 21164, 500 MHz, 1000 MFLOPSpeak• 3 on-chip performance countersPCL Hardware performance monitor: hpm% hpm –-events PCL_CYCLES, PCL_MFLOPS ./mghpm: elapsed time: 5.172 shpm: counter 0 : 2564941490 PCL_CYCLEShpm: counter 1 : 19.635955 PCL_MFLOPSProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 50


Using PCL – Example 2#include int main(int argc, char **argv) {// InitializationPCL_CNT_TYPE i_result[2];PCL_FP_CNT_TYPE fp_result[2];int counter_list[]= {PCL_FP_INSTR, PCL_MFLOPS},res;unsigned int flags= PCL_MODE_USER;PCL_DESCR_TYPE descr;Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 51


Using PCL – Example 2PCLinit(&descr);if(PCLquery(descr,counter_list,2,flags)!=PCL_SUCCESS)// Issue error message …else {PCL_start(descr, counter_list, 2, flags);// Do computational work here …PCLstop(descr,i_result,fp_result,2);printf(“%i fp instructions, MFLOPS: %f\n”,i_result[0], fp_result[1]);}PCLexit(descr);return 0;}Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 52


Using DCPIAlpha-based machines running Tru64 UNIXHow to proceed when using DCPI1. Start the DCPI daemon (dcpid)2. Run your code3. Stop the DCPI daemon4. Use DCPI tools to analyze the profilingdataProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 53


Examples of DCPI toolsdcpiwhatcg: Where have all the cyclesgone?dcpiprof: Breakdown of CPU time byproceduresdcpilist: Code listing (source/assembler)annotated with profiling datadcpitopstalls: Ranking of instructionscausing stall cyclesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 54


Using DCPI – Example 1% dcpiprof ./mgColumn Total Period (for events)------ ----- ------dmiss 45745 4096===================================================dmiss % cum% procedure image33320 72.84% 72.84% mgSmooth ./mg10008 21.88% 94.72% mgRestriction ./mg2411 5.27% 99.99% mgProlongCorr ./mg[…]Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 55


Using DCPI – Example 2Call the DCPI analysis tool:% dcpiwhatcg ./mgDynamic stalls are listed first:I-cache (not ITB) 0.1% to 7.4%ITB/I-cache miss 0.0% to 0.0%D-cache miss 24.2% to 27.6%DTB miss 53.3% to 57.7%Write buffer 0.0% to 0.3%Synchronization 0.0% to 0.0%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 56


Using DCPI – Example 2Branch mispredict 0.0% to 0.0%IMUL busy 0.0% to 0.0%FDIV busy 0.0% to 0.5%Other 0.0% to 0.0%Unexplained stall 0.4% to 0.4%Unexplained gain -0.7% to -0.7%---------------------------------------Subtotal dynamic 85.1%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 57


Using DCPI – Example 2Static stalls are listed next:Slotting 0.5%Ra dependency 3.0%Rb dependency 1.6%Rc dependency 0.0%FU dependency 0.5%-----------------------------------------------Subtotal static 5.6%-----------------------------------------------Total stall 90.7%Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 58


Using DCPI – Example 2Useful cycles are listed in the end:Useful 7.9%Nops 1.3%-----------------------------------------------Total execution 9.3%Compare to the total percentage of stall cycles:90.7% (cf. previous slide)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 59


Part IIOptimization TechniquesforStructured GridsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 60


How to make codes fast1 Use a fast algorithm (e.g., multigrid)I. It does not make sense to optimize a badalgorithmII. However, sometimes a fairly simplealgorithm that is well implemented willbeat a very sophisticated, super methodthat is poorly programmed2 Use good coding practices3 Use good data structures4 Apply appropriate optimization techniquesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 61


Optimization ofFloating-PointOperationsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 62


Optimization of FP operationsLoop unrollingFused Multiply-Add (FMA) instructionsExposing instruction-level parallelism (ILP)Software pipelining (again: exploit ILP)AliasingSpecial functionsEliminating overheads• if statements• Loop overhead• Subroutine calling overheadProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 63


Loop unrollingSimplest effect of loop unrolling: fewer test/jumpinstructions (fatter loop body, less loop overhead)Fewer loads per flopMay lead to threaded code that uses multiple FPunits concurrently (instruction-level parallelism)How are loops handled that have a trip count that isnot a multiple of the unrolling factor?Very long loops may not benefit from unrolling(instruction cache capacity!)Very short loops may suffer from unrolling or benefitstronglyProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 64


Loop unrolling: Making fatter loop bodiesdo i=1,Na(i)= a(i)+b(i)*cenddoii= imod(N,4)do i= 1,iia(i)= a(i)+b(i)*cenddodo i= 1+ii,N,4a(i)= a(i)+b(i)*ca(i+1)= a(i+1)+b(i+1)*ca(i+2)= a(i+2)+b(i+2)*ca(i+3)= a(i+3)+b(i+3)*cenddoExample: DAXPY operationdo i= 1,N,4a(i)= a(i)+b(i)*ca(i+1)= a(i+1)+b(i+1)*ca(i+2)= a(i+2)+b(i+2)*ca(i+3)= a(i+3)+b(i+3)*cenddoPreconditioning loop handlescases when N is not a multiple of 4Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 65


Loop unrolling:Improving flop/load ratioAnalysis of the flop-to-load-ratio often unveils anotherbenefit of unrolling:do i= 1,Ndo j= 1,My(i)=y(i)+a(j,i)*x(j)enddoenddoInnermost loop: three loads andtwo flops performed; i.e., wehave one load per flopProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 66


Loop unrolling:Improving flop/load ratiodo i= 1,N,2t1= 0Both loops unrolled twicet2= 0do j= 1,M,2t1= t1+a(j,i) *x(j) +a(j+1,i) *x(j+1)t2= t2+a(j,i+1)*x(j) +a(j+1,i+1)*x(j+1)enddoy(i) = t1Innermost loop: 8 loads and 8 flops!y(i+1)= t2Exposes instruction-level parallelismenddoHow about unrolling by 4?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityWatch out for register spill!HiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 67


Fused Multiply-Add (FMA)On many CPUs (e.g., IBM Power3/Power4)there is an instruction which multiplies twooperands and adds the result to a thirdConsider codea= b + c*d + f*gversusa= c*d + f*g + bCan reordering be done automatically?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 68


Exposing ILPprogram nrm1real a(n)tt= 0d0do j= 1,ntt= tt + a(j) * a(j)enddoprint *,ttendprogram nrm2real a(n)tt1= 0d0tt2= 0d0do j= 1,n,2tt1= tt1 + a(j)*a(j)tt2= tt2 + a(j+1)*a(j+1)enddott= tt1 + tt2print *, ttendProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 69


Exposing ILP (cont’d)Superscalar CPUs have a high degree of onchipparallelism that should be exploitedThe optimized code uses temporary variablesto indicate independent instruction streamsThis is more than just loop unrolling!Can this be done automatically?Change in rounding errors?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 70


Software pipeliningArranging instructions in groups that can beexecuted together in one cycleAgain, the idea is to exploit instruction-levelparallelism (on-chip parallelism)Often done by optimizing compilers, but notalways successfullyClosely related to loop unrollingLess important on out-of-order CPUsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 71


AliasingArrays (or other data) that refer to the samememory locationsAliasing rules are different for variousprogramming languages; e.g.,• FORTRAN forbids aliasing: unknown result• C/C++ permit aliasingThis is one reason why FORTRAN compilersoften produce faster code than C/C++compilers doProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 72


Aliasing (cont’d)Example:subroutine sub(n, a, b, c, sum)double precision sum, a(n), b(n), c(n)sum= 0d0do i= 1,na(i)= b(i) + 2.0d0*c(i)enddoreturnendFORTRAN rule: two variables cannot be aliased, when one or bothof them are modified in the subroutineCorrect call: call sub(n,a,b,c,sum)Incorrect call: call sub(n,a,a,c,sum)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 73


Aliasing (cont’d)Aliasing is legal in C/C++: compiler mustproduce conservative codeMore complicated aliasing is possible; e.g.,a(i) with a(i+2)C/C++ keyword restrict or compiler option-noaliasProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 74


Special functions / (divide) sqrt exp, log sin, cos, … Etc.are expensive (up to several dozen cycles)Use math Identities, e.g., log(x) + log(y) = log(x*y)Use special libraries that• vectorize when many of the same functionsmust be evaluated• trade accuracy for speed, when appropriateProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 75


Eliminating overheads:if statementsif statements … Prohibit some optimizations(e.g., loop unrolling in some cases) Evaluating the condition expression takes time CPU pipeline may be interrupted (dynamic jump prediction)Goal: avoid if statements in the innermost loopsNo generally applicable technique exists Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 76


Eliminating if statements:An examplesubroutine thresh0(n,a,thresh,ic)dimension a(n)ic= 0tt= 0.d0do j= 1,ntt= tt + a(j) * a(j)if (sqrt(tt).ge.thresh) thenic= jreturnendifenddoreturnendAvoid sqrt in condition!(square thresh instead)Add tt in blocks of 128 forexample (without condition) andrepeat last block when conditionis violatedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 77


Eliminating loop overheadsFor starting a loop, the CPU must free certainregisters: loop counter, address, etc.This may be significant for a short loop!Example: for n>mdo i= 1,ndo j= 1,m...is less efficient thando j= 1,mdo i= 1,n...However, data access optimizations are even moreimportant, see belowProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 78


Eliminating subroutine callingoverheadSubroutines (functions) are very important forstructured, modular programmingSubroutine calls are expensive (on the order of up to100 cycles)Passing value arguments (copying data) can beextremely expensive, when used inappropriatelyPassing reference arguments (as in FORTRAN) maybe dangerous from a point of view of correct softwareReference arguments (as in C++) with constdeclarationGenerally, in tight loops, no subroutine calls shouldbe usedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 79


Eliminating subroutine callingoverhead (cont’d)Inlining: inline declaration in C++ (see below), ordone automatically by the compilerMacros in C or any other language#define sqre(a) (a)*(a)What can go wrong:sqre(x+y) x+y*x+ysqre(f(x)) f(x) * f(x)What if f has side effects?What if f has no side effects, but the compiler cannotdeduce that?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 80


Memory HierarchyOptimizations:Data LayoutProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 81


Data layout optimizationsArray transpose to get stride-1 accessBuilding cache-aware data structures byarray mergingArray paddingEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 82


Data layout optimizationsStride-1 access is usually fastest for severalreasons; particularly the reuse of cache linecontentsData layout for multidimensional arrays inFORTRAN: column-major orderExample:4x3 arrayA(i,j)MemoryaddressProf. Craig C. DouglasUniversity of Kentucky andYale University840A(1,1)A(1,2)A(1,3)A(2,1)A(2,2)A(2,3)A(3,1)A(3,2)A(3,3)A(4,1)A(4,2)A(4,3)Data arrangement is “transpose” of usualmatrix layoutHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 83


Data layout optimizationsStride-1 access: innermost loop iterates over first index Either by choosing the right data layout (arraytranspose) or By arranging nested loops in the right order (loopinterchange):do i=1,Ndo j=1,Ma(i,j)=a(i,j)+b(i,j)enddoenddoProf. Craig C. DouglasUniversity of Kentucky andYale Universitydo j=1,Mdo i=1,Na(i,j)=a(i,j)+b(i,j)enddoenddoStride-N accessStride-1 accessThis will usually be done by the compiler!HiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 84


Data layout optimizations:Stride-1 accessdo i=1,Ndo j=1,Ms(i)=s(i)+b(i,j)*c(j)enddoenddoBetter transpose matrix b sothat inner loop gets stride 1How about loop interchange in this case?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 85


Data layout optimizations:<strong>Cache</strong>-aware data structuresIdea: Merge data which are needed togetherto increase spatial locality: cache linescontain several data itemsExample: Gauss-Seidel iteration, determinedata items needed simultaneously⎛−1= ⎜− ∑k + 1a− ∑i , ifiai , juj⎝ j < ij > ik + 1kuiai , juj⎞⎟⎠Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 86


Data layout optimizations:<strong>Cache</strong>-aware data structuresExample (cont’d): right-hand side andcoefficients are accessed simultaneously,reuse cache line contents by array merging enhance spatial localitytypedef struct {double f;double c_N, c_E, c_S, c_W, c_C;} equationData; // Data merged in memorydouble u[N][N];// Solution vectorequationData rhsAndCoeff[N][N]; // Right-hand side// and coefficientsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 87


Data layout optimizations:Array paddingIdea: Allocate arrays larger than necessary• Change relative memory distances• Avoid severe cache thrashing effectsExample (FORTRAN: column-major order):Replacedouble precision u(1024, 1024)bydouble precision u(1024+pad, 1024)How to choose pad?Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 88


Data layout optimizations:Array paddingC.-W. Tseng et al. (UMD):Research on cache modeling and compilerbasedarray padding:• Intra-variable padding: pad within arrays⇒ Avoid self-interference misses• Inter-variable padding: pad betweendifferent arrays⇒ Avoid cross-interference missesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 89


Data layout optimizations:Array paddingPadding in 2D; e.g., FORTRAN77:double precision u(0:1024+pad,0:1024)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 90


Memory HierarchyOptimizations:Data AccessProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 91


Loop optimizationsLoop unrolling (see above)Loop interchangeLoop fusionLoop split = loop fission = loop distributionLoop skewingLoop blockingEtc.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 92


Data access optimizations:Loop fusion• Idea: Transform successive loops into asingle loop to enhance temporal locality• Reduces cache misses and enhances cachereuse (exploit temporal locality)• Often applicable when data sets areprocessed repeatedly (e.g., in the case ofiterative methods)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 93


Data access optimizations:Loop fusionBefore:do i= 1,Na(i)= a(i)+b(i)enddodo i= 1,Na(i)= a(i)*c(i)enddo• a is loaded into thecache twice (ifsufficiently large)After:do i= 1,Na(i)= (a(i)+b(i))*c(i)enddo• a is loaded into thecache only onceProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 94


Data access optimizations:Loop fusionExample: red/black Gauss-Seidel iteration in 2DProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 95


Data access optimizations:Loop fusionCode before applying loop fusion technique(standard implementation w/ efficient loopordering, Fortran semantics: row major order):for it= 1 to numIter do// Red nodesfor i= 1 to n-1 dofor j= 1+(i+1)%2 to n-1 by 2 dorelax(u(j,i))end forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 96


Data access optimizations:Loop fusion// Black nodesfor i= 1 to n-1 dofor j= 1+i%2 to n-1 by 2 dorelax(u(j,i))end forend forend forThis requires two sweeps through the wholedata set per single GS iteration!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 97


Data access optimizations:Loop fusionHow the fusion technique works:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 98


Data access optimizations:Loop fusionCode after applying loop fusion technique:for it= 1 to numIter do// Update red nodes in first grid rowfor j= 1 to n-1 by 2 dorelax(u(j,1))end forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 99


Data access optimizations:Loop fusion// Update red and black nodes in pairsfor i= 1 to n-1 dofor j= 1+(i+1)%2 to n-1 by 2 dorelax(u(j,i))relax(u(j,i-1))end forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 100


Data access optimizations:Loop fusion// Update black nodes in last grid rowfor j= 2 to n-1 by 2 dorelax(u(j,n-1))end forSolution vector u passes through thecache only once instead of twice per GSiteration!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 101


Data access optimizations:Loop splitThe inverse transformation of loop fusionDivide work of one loop into two to makebody less complicated• Leverage compiler optimizations• Enhance instruction cache utilizationProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 102


Data access optimizations:Loop blockingLoop blocking = loop tilingDivide the data set into subsets (blocks)which are small enough to fit in cachePerform as much work as possible on thedata in cache before moving to the next blockThis is not always easy to accomplishbecause of data dependenciesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 103


Data access optimizations:Loop blockingExample: 1D blocking for red/black GS, respectthe data dependencies!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 104


Data access optimizations:Loop blockingCode after applying 1D blocking techniqueB = number of GS iterations to beblocked/combinedfor it= 1 to numIter/B do// Special handling: rows 1, …, 2B-1// Not shown here …Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 105


Data access optimizations:Loop blocking// Inner part of the 2D gridfor k= 2*B to n-1 dofor i= k to k-2*B+1 by –2 dofor j= 1+(k+1)%2 to n-1 by 2 dorelax(u(j,i))relax(u(j,i-1))end forend forend forProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 106


Data access optimizations:Loop blocking// Special handling: rows n-2B+1, …, n-1// Not shown here …end forResult: Data is loaded once into the cacheper B Gauss-Seidel iterations, if 2*B+2 gridrows fit in the cache simultaneouslyIf grid rows are too large, 2D blocking can beappliedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 107


Data access optimizationsLoop blockingMore complicated blocking schemes existIllustration: 2D square blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 108


Data access optimizations:Loop blockingIllustration: 2D skewed blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 109


Two common multigridalgorithmsV Cycle to solve A 4 u 4 =f 4Smooth A 4u 4=f 4. Set f 3= R 3r 4.Smooth A 3u 3=f 3. Set f 2= R 2r 3.Smooth A 2u 2=f 2. Set f 1= R 1r 2.Solve A 1u 1=f 1directly.Set u 4= u 4+ I 3u 3. Smooth A 4u 4=f 4.Set u 3= u 3+ I 2u 2. Smooth A 3u 3=f 3.Set u 2= u 2+ I 1u 1. Smooth A 2u 2=f 2.W CycleProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 110


<strong>Cache</strong>-optimized multigrid:DiMEPACK libraryDFG project DiME: Data-local iterative methodsFast algorithm + fast implementationCorrection scheme: V-cycles, FMGRectangular domainsConstant 5-/9-point stencilsDirichlet/Neumann boundary conditionsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 111


DiMEPACK libraryC++ interface, fast Fortran77 subroutinesDirect solution of the problems on thecoarsest grid (LAPACK: LU, Cholesky)Single/double precision floating-pointarithmeticVarious array padding heuristics (Tseng)http://www10.informatik.uni-erlangen.de/dimeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 112


V(2,2) cycle - bottom lineMflops1356150220For whatStandard 5-pt. Operator<strong>Cache</strong> optimized (loop orderings, datamerging, simple blocking)Constant coeff. + skewed blocking +paddingEliminating rhs if 0 everywhere butboundaryProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 113


Example:<strong>Cache</strong>-OptimizedMultigrid on RegularGrids in 3DProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 114


Data layout optimizationsfor 3D multigridArray paddingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 115


Data layout optimizationsfor 3D multigridStandard padding in 3D; e.g., FORTRAN77:double precision u(0:1024,0:1024,0:1024)becomes:double precision u(0:1024+pad1,0:1024+pad2,0:1024)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 116


Data layout optimizationsfor 3D multigrid Non-standard padding in 3D:double precision u(0:1024+pad1,0:1024,0:1024)...u(i+k*pad2, j, k)(or use hand-made index linearization – performance effect?)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 117


Data layout optimizationsfor 3D multigridArray mergingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 118


Data access optimizationsfor 3D multigrid1-way blocking with loop-interchangeProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 119


Data access optimizationsfor 3D multigrid2-way blocking and 3-way blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 120


Data access optimizationsfor 3D multigrid4-way blockingProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 121


Example:<strong>Cache</strong> Optimizations forthe Lattice BoltzmannMethodProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 122


Lattice Boltzmann method Mainly used in CFD applications Employs a regular grid structure (2D, 3D) Particle-oriented approach based on amicroscopic model of the moving fluidparticles Jacobi-like cell update pattern: a single timestep of the LBM consits of• stream step and• collide stepProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 123


LBM (cont‘d)Stream: read distribution functions fromneighborsCollide: re-compute own distribution functionsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 124


Data layout optimizationLayout 1:Two separate grids(standard approach)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 125


Layout 2: Grid Compression: save memory, enhance localityProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 126


Data accessoptimizationsAccess pattern 1: 3-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 127


3-way blocking (cont’d):Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 128


Access pattern 2: 4-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 129


4-way blocking (cont’d):Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 130


Illustration of the combination of layout + access optimizationsLayout: separate grids, access pattern: 3-way-blocking:Layout: separate grids, access pattern: 4-way-blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 131


Layout: Grid compression, access pattern: 3-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 132


Layout: grid compression, access pattern: 4-way blocking:Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 133


Performance resultsMFLOPS for 2D GS, const. coeff.s, 5-pt.,DEC PWS 500au, Alpha 21164 CPU, 500 MHzProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 134


Memory access behaviorDigital PWS 500au, Alpha 21164 CPUL1 = 8 KB, L2 = 96 KB, L3 = 4 MBWe use DCPI to obtain the performance dataWe measure the percentage of accesses which aresatisfied by each individual level of the memoryhierarchyComparison: standard implementation of red/blackGS (efficient loop ordering) vs. 2D skewed blocking(with and without padding)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 135


Memory access behaviorStandard implementation of red/black GS,without array paddingSize+/-L1L2L3Mem.334.563.632.00.00.0650.575.723.60.20.0129-0.276.19.314.80.02575.355.125.014.50.05133.937.745.212.40.810255.127.850.09.97.220494.530.345.013.07.2Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 136


Memory access behavior2D skewed blocking without array padding,4 iterations blocked (B = 4)Size336512925751310252049+/-27.433.436.938.138.036.936.2L143.446.342.334.128.324.925.5L229.119.519.125.127.019.70.4L30.10.91.72.76.717.636.9Mem.0.00.00.00.00.10.90.9Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 137


Memory access behavior2D skewed blocking with appropriate arraypadding, 4 iterations blocked (B = 4)Size+/-L1L2L3Mem.3328.266.45.30.00.06534.355.79.10.90.012937.551.79.01.90.025737.852.87.02.30.051338.452.76.22.40.3102536.754.36.12.00.9204935.955.26.01.90.9Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 138


Performance results (cont’d)3D MG, F77, var. coeff.s, 7-pt., Intel Pentium4,2.4 GHz, Intel ifc V7.0 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 139


Performance results (cont’d)2D LBM (D2Q9), C(++), AMD Athlon XP 2400+,2.0 GHz, Linux, gcc V3.2.1 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 140


Performance results (cont’d)<strong>Cache</strong> behavior (left: L1, right: L2) for previousexperiment, measured with PAPIProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 141


Performance results (cont’d)3D LBM (D3Q19), C, AMD Opteron, 1.6 GHz,Linux, gcc V3.2.2 compilerProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 142


C++-SpecificConsiderationsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 143


C++-specific considerationsWe will (briefly) address the following issues:InliningVirtual functionsExpression templatesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 144


InliningMacro-like code expansion: replace functioncall by the body of the function to be inlinedHow to accomplish inlining:• Use C++ keyword inline, or• Define the method within the declarationIn any case: the method to be inlined needsto be defined in the header fileHowever: inlining is just a suggestion to thecompiler!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 145


Inlining (cont‘d)Advantages:• Reduce function call overhead (see above)• Leverage cross-call optimizations: optimizethe code after expanding the loop body Disadvantage:Size of the machine code increases (instructioncache capacity!)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 146


Virtual functionsMember functions may be declared to be virtual(C++ keyword virtual)This mechanism becomes relevant when base classpointers are used to point to instances of derivedclassesActual member function to be called can often bedetermined only at runtime (polymorphism)Requires virtual function table lookup (at runtime!)Can be very time-consuming!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 147


Inlining virtual functionsVirtual functions are often not compatible withinlining where function calls are replaced byfunction bodies at compile time.If the type of the object can be deduced atcompile time, the compiler can even inlinevirtual functions (at least theoretically ...)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 148


Expression templatesC++ technique for passing expressions asfunction argumentsExpression can be inlined into the functionbody using (nested) C++ templatesAvoid the use of temporaries and thereforemultiple passes of the data through thememory subsystem; particularly the cachehierarchyProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 149


ExampleDefine a simple vector class in the beginning:class vector {private:int length;double a[];public:vector(int l);double component(int i) { return a[i]; }...};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 150


Example (cont‘d)Want to efficiently compute vector sums likec= a+b+d;Efficiently implies Avoiding the generation of temporary objects Pumping data through the memory hierarchyseveral times. This is actually the timeconsumingpart. Moving data is moreexpensive than processing data!Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 151


Example (cont‘d)Need a wrapper class for all expressions:templateclass DExpr { // double precision expressionprivate:A wa;public:DExpr(const A& a) : wa(a) {}double component(int i) { return wa.component(i); }};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 152


Example (cont‘d)Need an expression template class to represent sumsof expressionstemplateclass DExprSum {A va;B vb;public:DExprSum(const A& a, const B& b) : va(a), vb(b) {}double component(int i) {return va.component(i) + vb.component(i);}};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 153


Example (cont‘d)Need overloaded operator+() variants for allpossible return types, for example:templateDExproperator+(const DExpr& a, const DExpr& b) {typedef DExprSum ExprT;return DExpr(ExprT(a,b));};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 154


Example (cont‘d)The vector class must contain a memberfunction operator=(const A& ea), where A isan expression template class.Only when this member function is called, theactual computation (vector sum) takes place.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 155


Part IIIOptimization Techniques for Unstructured GridsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 156


Optimizations forunstructured gridsHow unstructured is the grid?Sparse matrices and data flow analysisGrid processingAlgorithm processingExamplesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 157


Is It really unstructured?This is really a quasi-unstructured mesh: there is plenty of structure inmost of the oceans. Coastal areas provide a real challenge.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 158


Subgrids and patchesProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 159


Motivating example• Suppose probleminformation for only half ofnodes fits in cache.• Gauss-Seidel updates nodesin order• Leads to poor use of cache• By the time node 37 isupdated, information fornode 1 has probablybeen evicted from cache.• Each unknown must bebrought into cache ateach iteration.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 160


Motivating example• Alternative• Divide into two connectedsubsets.• Renumber• Update as much as possiblewithin a subset beforevisiting the other.• Leads to better data reusewithin cache.• Some unknowns can becompletely updated.• Some partial residuals can becalculated.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 161


<strong>Cache</strong> aware Gauss-SeidelPreprocessing phase• Decompose each mesh into disjoint cache blocks.• Renumber the nodes in each block.• Find structures in quasi-unstructured case.• Produce the system matrices and intergrid transfer operatorswith the new ordering.Gauss-Seidel phase• Update as much as possible within a block without referencingdata from another block.• Calculate a (partial) residual on the last update.• Backtrack to finish updating cache block boundaries.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 162


Preprocessing: Mesh decompositionGoals• Maximize interior of cache block.• Minimize connections between cache blocks.Constraint• <strong>Cache</strong> should be large enough to hold the part of matrix, righthand side, residual, and unknown associated with a cache block.Critical parameter: usable cache size.Such decomposition problems have been studied in depth for loadbalancing parallel computation.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 163


Example of subblock membership<strong>Cache</strong> blocks identifiedSubblocks identifiedˆ 2Ω h<strong>Cache</strong> block2L 12L 2L 3boundary2•• •ˆ 1hΩProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 164


Distance algorithmsSeveral constants• d The degrees of freedom per vertex.• K Average number of connections per vertex.• N Ω Number of vertices in cache block N Ω .Three cases for complexity bounds• <strong>Cache</strong> boundaries connected.• Physical boundaries unknown.• Physical boundaries known.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 165


Standard Gauss-SeidelThe complexity bound C gs in this notation is given byC gs ≤ 2d 2 N Ω K+d.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 166


<strong>Cache</strong> boundaries connectedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 167


<strong>Cache</strong> boundaries connectedProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 168


<strong>Cache</strong> boundaries connectedThe complexities of Algorithms 1 and 2 areandC 1 ≤ 5N Ω KC 2 ≤ (7K+1)N Ω .The cost of Algorithms 1 and 2 with respect toGauss-Seidel is at most 6d -2 sweeps on the finest grid.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 169


Physical boundaries unknownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 170


Physical boundaries unknownThe complexity of Algorithm 3 isC 1 ≤ 9N Ω K.The cost of Algorithms 1 and 3 with respect toGauss-Seidel is at most 7d -2 sweeps on the finest grid.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 171


Physical boundaries knownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 172


Physical boundaries knownProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 173


Physical boundaries knownThe complexity of Algorithms 1, 4, and 5 isC 1,4,5 ≤ (11K+1)N Ω .The cost of Algorithms 1, 4, and 5 with respect to Gauss-Seidelis at most 10d -2 sweeps on the finest grid. This assumes thatthe number of physical boundary nodes is (N Ω ) ½ , which is quitepessimistic.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 174


Physical boundaries knownIf we assume that there are no more than ½N Ω physical boundarynodes per cache block, then we get a better bound, namely, the costof Algorithms 1, 4, and 5 with respect to Gauss-Seidel is at most5.5d -2 sweeps on the finest grid.Clearly with better estimates of realistic number of physicalboundary nodes (i.e., much less than ½N Ω ) per cache block, we canreduce this bound.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 175


Preprocessing costsPessimistic (never seen to date in a real problem) bounds• d = 1: 5.5 Gauss-Seidel sweeps• d = 2: 1.375 Gauss-Seidel sweeps• d = 3: 0.611 Gauss-Seidel sweepsIn practice bounds• d = 1: ~1 Gauss-Seidel sweeps• d = 2: ~ ½ Gauss-Seidel sweepsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 176


Numerical results: AustriaExperiment: AustriaTwo dimensional elasticityT = Cauchy stress tensor, w = displacementf = (1,-1) T on Γ 4 ,(9.5-x,4-y) if (x,y) is in region surrounded by Γ 5 , and0 otherwise.–∇T = f in Ω,∂w/∂n = 100w on Γ 1 ,∂w/∂y = 100w on Γ 2 ,∂w/∂x = 100w on Γ 3 ,∂w/∂n = 0 everywhere else.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 177


Numerical experiments: BavariaExperiment: BavariaCoarse grid meshStationary heat equationwith 7 sources and one sink (MunichOktoberfest).Homogeneous Dirichlet boundaryconditions on the Czech border(northeast), homogeneous Neumannb.c.’s everywhere else.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 178


Code commentaryPreprocessing steps separate.Computation• Standard algorithms implemented in as efficient a manner knownwithout the cache aware or active set tricks. This is not RandyBank’s version, but much, much faster.• <strong>Cache</strong> aware implementations uses quasi-unstructuredinformation.• The codes work equally well in 2D and 3D (the latter needs alarger cache than 2D) and are really general sparse matrix codeswith a PDE front end.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 179


Implementation detailsOne parameter tuning: usable cache size, which is normally ~60% ofphysical cache size.Current code• Fortran + C (strict spaghetti coding style and fast).• Requires help from authors to add a new domain and coarsegrid.New code• C++• Should not require any author to be present.• Is supposed to be available by end of September and willprobably be done earlier. Aiming for inclusion in ML (seesoftware.sandia.gov)Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 180


Books W. Briggs, V.E. Henson, S.M. McCormick, A Multigrid <strong>Tutorial</strong>,SIAM, 2000. D. Bulka, D. Mayhew, Efficient C++, Addison-Wesley, 2000. C.C. Douglas, G. Haase, U. Langer, A <strong>Tutorial</strong> on Elliptic PDESolvers and Their Parallelization, SIAM, 2003. S. Goedecker, A. Hoisie: Performance Optimization ofNumerically Intensive Codes, SIAM, 2001. J. Handy, The <strong>Cache</strong> Memory Book, Academic Press, 1998J. L. Hennessy and D. A. Patterson, Computer Architecture: AQuantitative Approach, 2 nd ed., Morgan Kauffmann Publishers,1996.U. Trottenberg, A. Schuller, C. Oosterlee, Multigrid, AcademicPress, 2000.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 181


Journal articles C.C. Douglas, Caching in with multigrid algorithms: problems intwo dimensions Parallel Algorithms and Applications, 9 (1996),pp. 195-204. C.C. Douglas, G. Haase, J. Hu, W. Karl, M. Kowarschik, U.Rüde, C. Weiss, Portable memory hierarchy techniques for PDEsolvers, Part I, SIAM News 33/5 (2000), pp. 1, 8-9. Part II, SIAMNews 33/6 (2000), pp. 1, 10-11, 16. C. C. Douglas, J. Hu, W. Karl, M. Kowarschik, U. Rüde, C.Weiss, <strong>Cache</strong> optimization for structured and unstructured gridmultigrid, Electronic Transactions on Numerical Analysis, 10(2000), pp. 25-40. C. Weiss, M. Kowarschik, U. Rüde, and W. Karl, <strong>Cache</strong>-awaremultigrid methods for solving Poisson's equation in twodimension, Computing, 64(2000), pp. 381-399.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 182


Conference proceedingsM. Kowarschik, C. Weiss, DiMEPACK – A cache-optimizedmultigrid library, Proc. of the Intl. Conference on Parallel andDistributed Processing Techniques and Applications (PDPTA2001), vol. I, June 2001, pp. 425-430.U. Rüde, Iterative algorithms on high performance architectures,Proc. of the EuroPar97 Conference, Lecture Notes in ComputerScience, Springer, Berlin, 1997, pp. 57-71.U. Rüde, Technological trends and their Impact on the future ofsupercomputing, H.-J. Bungartz, F. Durst, C. Zenger (eds.),High Performance Scientific and Engineering Computing,Lecture Notes in Computer Science, Vol. 8, Springer, 1998, pp.459-471.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 183


Conference proceedingsC. Weiss, W. Karl, M. Kowarschik, U. Rüde, Memorycharacteristics of iterative methods, Proc. of theSupercomputing Conference, Portland, Oregon, November1999.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 184


Related websiteshttp://www.mgnet.orghttp://www.mgnet.org/~douglas/ccd-preprints.htmlhttp://www.mgnet.org/~douglas/ml-dddas.htmlhttp://www10.informatik.uni-erlangen.de/dimehttp://valgrind.kde.orghttp://www.fz-juelich.de/zam/PCLhttp://icl.cs.utk.edu/projects/papihttp://www.hipersoft.rice.edu/hpctoolkithttp://www.tru64unix.compaq.com/dcpihttp://ec-securehost.com/SIAM/SE16.htmlProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 185

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!