Runtime support for memory adaptation in scientific applications via ...
Runtime support for memory adaptation in scientific applications via ...
Runtime support for memory adaptation in scientific applications via ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Runtime support for memory adaptation in scientific applications via localdisk and remote memory ∗C. Yue † R. T. Mills ‡ A. Stathopoulos † D. S. Nikolopoulos †AbstractThe ever increasing memory demands of many scientificapplications and the complexity of today’s sharedcomputational resources still require the occasional useof virtual memory, network memory, or even out-of-coreimplementations, with well known drawbacks in performanceand usability. In this paper, we present a generalframework, based on our earlier MMLIB prototype ,that enables fully customizable, memory malleability in awide variety of scientific applications. We provide severalnecessary enhancements to the environment sensing capabilitiesof MMLIB and introduce a remote memory capability,based on MPI communication of cached memoryblocks between ‘compute nodes’ and designated memoryservers. We show experimental results from three importantscientific applications that require the general MM-LIB framework. Under constant memory pressure, weobserve execution time improvements of factors betweenthree and five over relying solely on the virtual memorysystem. With remote memory employed, these factors areeven larger and significantly better than other, systemlevelremote memory implementations.1 IntroductionCommoditization of memory chips has enabled unprecedentedincreases in the memory available on today’scomputers and at rapidly decreasing costs. Manufacturingand marketing factors, however, keep the costs dis-∗ Work supported by the National Science Foundation (ITR/ACS-0082094, ITR/AP-0112727, ITR/ACI-0312980, CAREER/CCF-0346867) the Department of Energy (DOE-FG-02-05ER2568), a DOEcomputational sciences graduate fellowship, and the College of Williamand Mary. Part of this work was performed using SciClone clusterat the College of William and Mary which were enabled by grantsfrom Sun Microsystems, the NSF, and Virginia’s CommonwealthTechnology Research Fund.† Department of Computer Science, College ofWilliam and Mary, Williamsburg, Virginia 23187-8795,(email@example.com).‡ National Center for Computational Sciences, Oak Ridge NationalLaboratory, Oak Ridge, TN (firstname.lastname@example.org)proportionally high for larger memory chips. Therefore,although many shared computational resource pools oreven large scale MPPs boast a large aggregate memory,only a small amount (relative to the high demands ofmany scientific applications) is available on individualprocessors. Moreover, available memory may vary temporallyunder multiprogramming. Sharing memory resourcesacross processors is a difficult problem, particularlywhen applications cannot reserve sufficient memoryfor sufficiently long times. These realities pose tremendousproblems in many memory demanding scientific applications.To quantify the magnitude of the problem, we use asimple motivating example. We ran a parallel multigridcode to compute a three-dimensional potential field onfour SMPs that our department maintains as a resourcepool for computationally demanding jobs. Each SMP has1 GB of memory. Since our code needed 860 MB perprocessor, we could run it using only one processor pernode. While our code was running other users could runsmall jobs without interference. When a user attemptedto launch Matlab to compute the QR decomposition of alarge matrix on one of the processors in an SMP, the timeper iteration in our multigrid code jumped from 14 to 472seconds as virtual memory system thrashed. Most virtualmemory systems would cause thrashing in this case, becausetheir page replacement policies are not well suitedfor this type of scientific application.Memory pressure can also be encountered on dedicatedor space-shared COWs and MPPs. An importantclass of scientific applications runs a large number of sequentialor low parallel degree jobs to explore a range ofparameters. Typically, larger numbers of processors enablehigher throughput computing. However, when memoryis insufficient for individual jobs, such applicationsgrind to a halt under virtual memory. On some MPPs,with small or no local disks, virtual memory is not adequateor even possible. Utilizing more processors per jobto exploit larger aggregate memory may not be possibleeither, because the jobs are sequential or of limited parallelscalability. Application scientists often address theseissues by implementing specialized out-of-core codes,and utilizing the parallel file systems that many MPPs1
employ. Besides the double performance penalty (dataread from disk and propagated through the interconnectionnetwork), such codes often lack flexibility, performingI/O even for data sets that could fit in memory.Scenarios such as the above are one of the reasons forusing batch schedulers, resource matchmakers, processmigration and several other techniques that warrant thateach application will have enough memory to run withoutthrashing throughout its lifetime. These methods arenot without problems. They may incur high waiting timesfor jobs, or high runtime migration overheads wheneverthere is resource contention. On certain platforms, suchas shared-memory multiprocessors, some of these methodsare not even applicable, as users may reserve morememory than their memory-per-CPU share . A similarsituation can occur if applications are allowed to usenetwork RAM (NRAM) or other remote memory mechanisms. Moreover, most remote memory researchhas been conducted at page-level granularity, which mayincur unacceptably high latencies for page replacement.The problem is equally hard at the sequential level.Page replacement policies of virtual memory systemsusually target applications such as transaction processingthat do not share the same memory access patterns asmost scientific applications. In addition, high seek timesduring thrashing cannot be amortized by prefetching, asit is unreasonable to expect the virtual memory system topredict the locality and pattern of block accesses on disk.On the other hand, compiler or user-provided hints wouldrequire modifications to the system.To tame these problems we have developed a runtimelibrary, MMLIB (Memory Malleability library), thatcontrols explicitly the DRAM allocations of specifiedlarge objects during the runtime of an application, thusenabling it to execute optimally under variable memoryavailability. MMLIB allows applications to use customized,application-specific memory management policies,running entirely at user-level. To achieve portabilityand performance, MMLIB blocks specified memoryobjects into panels and manages them through memorymapping. This provides programmers with a familiar interfacethat has no explicit I/O and can exploit the samecommon abstractions used to optimize code for memoryhierarchies. Moreover, the advantages of runningapplications fully in-core are maintained, when enoughmemory is available. The library is designed for portabilityacross operating systems and implemented in a nonintrusivemanner.In , we gave a proof of concept of MMLIB basedon a simplified framework, and developed a parameterfreealgorithm to accurately ascertain memory shortageand availability strictly at user-level. In this paper, wefirst provide a general framework that enables memorymalleability in a variety of scientific applications, and enhanceMMLIB’s sensing capabilities to require no userinput. Second, we introduce remote memory capabilityinto MMLIB, based on MPI communication of panelsbetween compute nodes and designated memory servers.Besides performance improvements on clusters with highspeed networks, our flexible, user-level design enables ahost of options such as multiple memory servers, and dynamicmigration of panels between servers.We see several benefits in this library-based approachfor memory malleability. Injecting memory malleabilityinto scientific applications allows them to run undermemory pressure with degraded but acceptable efficiency,under a wider variety of execution conditions.Efficient memory adaptive applications can benefit bothhigh-level batch schedulers, by letting them harness cyclesfrom busy machines with idle memory, and operatingsystem schedulers, by avoiding thrashing and thus securingcontinuous service to jobs. Also, as we show in thispaper, the transparent design enables the library to implementremote memory when local disks are small orslower than fetching data from the network. MMLIB isan optimization tool for DRAM accesses at the low levelof the memory hierarchy (disk/network), but it can coexistwith and complement optimization tools for higherlevels (memory/cache), enabling a unified approach to localityoptimization in block-structured codes.We present experimental results from three importantscientific applications that we linked with MMLIB. Besidestheir importance for scientific computing, these applicationsstress different aspects of MMLIB and motivatethe runtime optimizations presented in this work.2 Related workInteresting memory usage and availability patternson multiprogrammed clusters of workstations has beenpointed out in a quantitative study by Acharya and Setia. They show that, on average, more than half of thememory of the clusters is available for intervals between5 and 30 minutes, with shorter intervals for larger memoryrequests. The study did not investigate mechanismsand policies for exploiting idle memory or the impact offluctuations of idle memory on application performance.Batch schedulers for privately owned networks ofworkstations and grids, such as Condor-G  and theGrADS scheduling framework , use admission controlschemes which schedule a job only on nodes withenough memory. This avoids thrashing at the cost of reducedutilization of memory and potentially higher jobwaiting times. Other coarse-grain approaches for avoidingthrashing include checkpointing and migration ofjobs. However, such approaches are not generally awareof the performance characteristics or the execution state
of the program .Chang et.al.  have presented a user-level mechanismfor constraining the resident set sizes of user programswithin a fixed range of available memory. Theyassume that the program has an a-priori knowledge oflower and upper bounds of the required memory range.This work does not consider dynamic changes to memoryavailability, nor does it address the problem of customizingthe memory allocation and replacement policy to thememory access pattern of the application—both centralissues in our research.An approach that addresses a dynamically changingenvironment for general applications has been developedby Brown and Mowry . This approach integratescompiler analysis, operating system support, and a runtimelayer to enable memory-intensive applications to effectivelyuse paged virtual memory. The runtime layermakes the appropriate memory allocation decisions byprocessing hints on the predicted memory usage that thecompiler inserted. Although the approach has shownsome good results, it requires modifications to the operatingsystem. In addition, applications with complexmemory access patterns can cause significant difficultiesin identifying appropriate release points.Barve and Vitter  presented a theoretical frameworkfor estimating the optimal performance that algorithmscould achieve if they adapted to varying amountsof available memory. They did not discuss implementationdetails or how system adaptivity can be achieved.Pang et al.  presented a sorting algorithm that dynamicallysplits and merges the resident buffer to adaptto changes in the memory available to the DBMS. This isa simulation-based study that does not discuss any detailsof the adaptation interface.Remote memory servers have been employed in multicomputers for jobs that exceed the available memoryper processor. They also enabled the implementationof diskless checkpointing , a fault-tolerance schemewhich exploits the speed of the interconnection networkto accelerate saving and restoring of program state.The advent of high-throughput computing on sharedcomputational resources has motivated the design ofNRAM systems for clusters of workstations [17, 4, 5,12, 26]. Real implementations of NRAM and memoryservers [5, 15, 11, 31] extend the operating system pagingalgorithms and provide support for consistency andfault tolerance at the page level. Though performanceimprovements have been reported over disk-based virtualmemory systems, the page level fine granularity stillincurs significant overheads, and thrashing can still occur.Moreover, such implementations require substantialchanges to the operating system.Several researchers have utilized remote memory forimplementing co-operative caching and prefetching inclustered web servers. Recently, Narravula et al. , exploitremote memory and direct remote DMA operationsto improve utilization of aggregate distributed caches ina co-operative caching envronment. Our work differsin that it employs remote memory in an applicationcontrolledexecution environment.Koussih et al.  describe a user-level, remotememory management system called Dodo. Based ona Condor-like philosophy, Dodo harvests unused memoryfrom idle workstations. It provides allocation andmanagement of fine-grained remote memory objects, aswell as user-defined replacement policies, but it does notinclude adaptation mechanisms for dynamically varyingRAM, and cannot apply to local disks, or to remote memoryservers that are not idle.In , Nikolopoulos presented an adaptive schedulingscheme for alleviating memory pressure on multiprogrammedCOWs, while co-ordinating the scheduling ofthe communicating threads of parallel jobs. That schemerequired modifications to the operating system. In ,the same author suggested the use of dynamic memorymapping for controlling the resident set of a programwithin a range of available physical memory. The algorithmoperated at page-level granularity. In , twoof the authors of this paper followed an application-levelapproach that avoided thrashing of an eigenvalue solver,by having the node under memory pressure recede itscomputation during that most computationally intensivephase, hopefully speeding the completion of competingjobs.MMLIB bares similarities with SHMOD (Shared-Memory on Disk) , an application-level asynchronousremote I/O library, which enables effectiveremote disk space usage, continuous application-levelcheckpointing and out-of-core execution. SHMOD is designedto support specifically a class of hydrodynamicsapplications and uses coarse-grain panels (called thingsin SHMOD’s terminology), allocated and managed transparentlyacross local and remote disks in a cluster ofworkstations. MMLIB differs in that it utilizes remoteDRAM instead of remote disks for faster retrieval of panels.3 User-level adaptationApplication-level approaches are sometimes receivedwith caution because of increased developer involvement.However, to exploit higher memory hierarchies, developersof scientific applications already block the accessesto the data or use libraries where such careful blocking isprovided. Blocking for memory access is performed at amuch larger granularity and thus complementary to cacheaccess. Based on this, our approach in  consideredthe largest data object partitioned into P blocks, which
in out-of-core literature are often called panels, and operatedas follows:for i = 1:PGet panel p pattern(i) from lower level memoryWork with p pattern(i)Most scientific applications consist of code segments thatcan be described in this familiar to developers format. Aslong as the “get panel” encapsulates the memory managementfunctionality, no code restructuring is ever needed.On a dedicated workstation one can easily select betweenan in-core or an out-of-core algorithm and datastructure according to the size of the problem. On a nondedicatedsystem though, the algorithm should adapt tomemory variability, running as fast as an in-core algorithmif there is enough memory to store its entire dataset, or utilizing the available memory to cache as manypanels as possible. Based on memory mapped I/O, weprovided a framework and supporting library for modifyingcodes for memory adaptivity that are portable tomany applications and operating systems. Memory mappinghas several advantages over conventional I/O becauseit avoids write-outs to swap space of read-only panels,integrates data access from memory and disk, allowsfor fine tuning of the panel size to hide disk latencies andfacilitates an implementation of various cache replacementpolicies.To get a new panel, the application calls a functionfrom our library that also controls the number of panelskept in-core. At this point, the function has three choices:it can increase or decrease the number of in-core panels ifadditional, or respectively less, memory is available; or itcan sustain the number of in-core panels if no change inmemory availability is detected. The policy for selectingpanels to evict is user defined as only the application hasfull knowledge of the access pattern.Critical to this functionality is that our library be ableto detect memory shortage and availability. However, theamount of total available memory is a global system informationthat few systems provide, and even then, it isexpensive and with no consistent semantics. In , wedeveloped an algorithm which relies only on measurementsof the program’s resident set size (RSS), a widelyportable and local information. Memory shortage is inferredfrom a decrease in a program’s RSS that occurswithout any unmapping on the part of the program. Memorysurplus is detected using a “probing” approach inwhich the availability of a quantity of memory is determinedby attepmting to use it and seeing if it can be maintainedin the resident set. The algorithm is parameterfree,expecting only an estimate of the memory requirementsof the program, excluding the managed panels. Wecall the size of this non-managed memory, static memory(sRSS). The algorithm detects memory availability,Identify memory objects M 1, M 2, . . . , M kneeded during this phasefor i = [ Iteration Space for all Objects ]for j= [ all Objects needed for iteration i ]panelID = accessPattern(M j, i)Get panel or portion of panel (panelID)endforWork on required panels or subpanelsfor j= [ all Objects needed for iteration i ]panelID = accessPattern(M j, i)if panel panelID was modifiedWrite Back(panelID)if panel panelID not needed persistentlyRelease(panelID)endforendforFigure 1. Extended framework modelingthe memory access needs of a wide varietyof scientific applications.by probing the system at dynamically selected time intervals,attempting to increase memory usage one panel at atime. For the effectiveness of this algorithm in identifyingthe available memory, the reader is referred to .3.1 A general frameworkThe above simplified framework is limited to a classof applications with repeated, exhaustive passes throughone, read-only object. We have developed a more comprehensiveframework that captures characteristics froma much larger class of applications. Scientific applicationsoften work on many large memory objects at a time,with various access patterns for each object, sometimesworking persistently on one panel, while other panels areonly partially accessed, or even modified. This extendedframework of applications is shown in Figure 1.Figure 1 depicts only one computation phase, which isrepeated several times during the lifetime of the program.A computation phase denotes a thematic beginning andend of some computation, e.g., one iteration of the CGmethod or the two innermost of three nested loops. In thisphase, a small number of memory objects are accessed(e.g., the matrix and the preconditioner in the CG algorithm),as their sheer size limits their number. In contrastto the previous simplified framework, we do not assumea sequential pass through all the panels of an object, althoughthis is often the case in practice. In this context, afull sweep denotes a completion of the phase.For each iteration of the computation phase, certainpanels from certain memory objects need to be fetched,worked upon, and possibly written back. The iterationspace in the current computation phase, the objects
needed for the current iteration, and the access patternsfor panels depend on the algorithm and can be describedby the programmer. Finally, our new framework allowsmemory objects to fluctuate in size between differentcomputation phases.We have extended MMLIB to implement this generalframework, hiding all bookkeeping and adaptation decisionsfrom the user. The user interface that enables thistransparency and a detailed discussion of the technical issuesinvolved are described in detail in .3.2 Estimating static memory sizeOur memory adaptation algorithm in  assumesthat the program has an accurate estimate of the size of itsstatic memory, i.e., memory not associated with managedobjects. This is needed for calculating how much of theRSS belongs to the mapped objects. However, this sizemay not be easily computed or even available to the programif large enough static memory is allocated withinlinked, third party libraries. Moreover, for some programsthe static memory may fluctuate between sweepsof the computation phase. A more serious problem ariseswhen the static memory is not accessed during the computationphase. Under memory pressure, most operatingsystems consider the static memory least recently usedand slowly swap it out of DRAM. This causes a falsedetection of memory shortage, and the unmapping of asmany panels as the size of the swapped static memory.An elegant solution to this problem relies on a systemcall named mincore() for most Unix systems andVirtualQuery() for Windows. The call allows aprocess to obtain information about which pages from acertain memory segment are resident in core. BecauseMMLIB can only ascertain residency of its MMS objects,it uses mincore() to compute the actual residentsize of the MMS panels, which is exactly what our algorithmneeds. Obviously, the use of this technique at everyget panel is prohibitive because of the overheadassociated with checking the pages of all mapped panels.We follow a more feasible, yet equally effective strategy.We measure the residency of all panels (mRSS) in thebeginning of a new computational phase, and derive anaccurate estimate of the static memory: sRSS = RSS −mRSS, with RSS obtained from the system. As long asno memory shortage is detected, we use sRSS during thecomputation phase. Otherwise, we recompute mRSS andsRSS to make sure we do not unnecessarily unmap panels.Since unmapping is not a frequent occurrence theoverall mincore overhead is tiny, especially comparedto the slow down the code experiences when unmappingis required.3.3 A most-missing eviction policyOne of the design goals of MMLIB is to preempt thevirtual memory system paging policy by holding the RSSof the application below the level at which the systemwill begin to swap out the pages of the application. Underincreasing memory pressure, the paging algorithmof the system could be antagonistic by paging out datathat MMLIB tries to keep resident, thus causing unnecessaryadditional memory shortage. In this case, it maybe beneficial to “concede defeat” and limit our losses byevicting those panels that have had most of their pagesswapped out, rather than evicting according to our policy,say MRU. The rationale is that if the OS has evictedLRU pages, these will have to be reloaded either way, sowe might as well evict the corresponding panels. EvictingMRU panels may make things worse because we willhave to load the swapped out LRU pages as well as theMRU panels that we evicted.The mincore functionality we described above facilitatesthe implementation of this “most missing” policy.This policy is not at odds with the user specified policybecause it is only applied when memory shortage is detected,which is when the antagonism with the systempolicy can occur. Under constant or increasing memoryavailability the user policy is in effect. Preliminary resultsin section 5 show clear advantages with this policy.4 Remote memory extensionDespite a dramatic increase in disk storage capacities,improvements in disk latencies have lagged significantlybehind those of interconnection networks. Hence, remotevirtual memory has often been suggested . The argumentis strengthened by work in [3, 2, 16] showing thatthere is significant benefit to harvesting the ample idlememory in computing clusters for data-intensive applications.The argument is imposing on MPPs, where parallelI/O must pass also through the interconnection network.The general MMLIB framework in Figure 1 lends itselfnaturally to a remote memory extension. The keymodification is that instead of memory mapping a newpanel from the corresponding file on disk, we allocatespace for it and request it from a remote memory server.This server stores and keeps track of unmapped panelsin its memory, while handling the mapping requests. Inimplementing this extension we had to address severaldesign issues.First, we chose MPI for the communication betweenprocessors, because it is a widely portable interface thatusers are also familiar with. Also, because MMLIBworks entirely at user-level, we need the user to be ableto designate which processors will play the role of remotememory servers. This is deliberately unlike the Dodo or
our experiment, we use one panel per vector for a totalof 30 vectors, each of 3,500,000 doubles (80 MB total).This test is possible only through our new MMLIB, asthe size of the MMS object varies at runtime, and multiplepanels are active simulateneously.The third application is an implementation of the Isingmodel, which is used to model magnetism in ferromagneticmaterial, liquid-gas transition, and other similarphysical processes. Considering a two-dimensional latticeof atoms, each of which has a spin of either up ordown, the code runs a Monte-Carlo simulation to generatea series of configurations that represent thermal equilibrium.The memory accesses are based on a simple 5-point stencil computation. For each iteration the codesweeps the lattice and tests whether to flip the spin ofeach lattice site. The flip is accepted, if it causes a negativepotential energy change, ∆E. Otherwise, the spinflips with a probability equal to exp −∆EkT, where k is theBoltzman constant and T the ambient temperature. Thehigher the T , the more spins are flipped. In computationalterms, T determines the frequency of writes to thelattice sites at every iteration. The memory-adaptive versionpartitions the lattice row-wise into 40 panels. Tocalculate the energy change at panel boundaries, the codeneeds the last row of the above neighboring panel andthe first row of the below neighboring panel. The newMMLIB framework is needed for panel write backs withvariable frequency, and multiple active panels.In all three applications the panel replacement policyis MRU, but there is also use of persistent (MGS) andneighboring (Ising) panels.5.2 Adaptation resultsFigure 2 includes one graph per application, eachshowing the performance of three versions of that applicationunder constant memory pressure. Each pointin the charts shows the execution time of one versionof the application when run against a dummy job thatoccupies a fixed amount of physical memory, using themlock() system call to pin its pages in-core. For eachapplication we test a conventional in-core version (bluetop curve), a memory-adaptive version using MMLIB(red lower curve), and an ideal version (green lowestcurve) in which the application fixes the number of panelscached at an optimal value provided by an oracle. Thecharts show the performance degradation of the applicationsunder increasing levels of memory pressure. Inall three applications, the memory-adaptive implementationperforms consistently and significantly better thanthe conventional in-core implementation. Additionally,the performance of the adaptive code is very close to theideal-case performance, without any advance knowledgeof memory availability and static memory size, and re-Average time (seconds) per CG iterationAverage MGS time (seconds)Average time (seconds) per Ising sweep403530252015105Average CG iteration time vs. memory pressureMemory-adaptivepanel_in fixed at optimal valueIn-core010 20 30 40 50 60 70 8040035030025020015010060504030201050Size of locked region (MB)Average MGS time for last 10 vectors vs. memory pressureMemory-adaptivepanels_in fixed at optimal valueIn-core00 10 20 30 40 50 60 70 80Size of locked region (MB)Average Ising sweep time vs. memory pressureMemory-adaptivepanels_in fixed at optimal valueIn-core010 20 30 40 50 60 70 80 90Size of locked region (MB)Figure 2. Performance under constant memorypressure. The left chart shows the averagetime per iteration of CG with a 70 MB matrix,which requires a total of 81 MB of RAM includingmemory for the work vectors. The middlechart shows the time to orthogonalize via modifiedGram-Schmidt the last 10 vectors of a 30vector set. Approximately 80 MB are requiredto store all 30 vectors. The right chart showsthe time required for an Ising code to sweepthrough a 70 MB lattice. All experiments wereconducted on a Linux 2.4.22-xfs system with128 MB of RAM, some of which is shared withthe video subsystem.
gardless of the number of active panels (whether readonlyor read/write).2422Performance with evict_mru_greedy() vs. evict_most_missing()evict_mru_greedy()evict_most_missing()20Resident set size (KB)1600001400001200001000008000060000ising_mem: 150 MB versus 150 MBAverage time (seconds) for iteration18161412108644000020 1 2 3 4 5 6 7 8 9Iteration number20000150 MB150 MB00 200 400 600 800 1000 1200 1400 1600 1800 2000Time elapsed (seconds)Figure 3. Running adaptive versus adaptivejobs. RSS vs. time for two equal-sized (150 MB)memory-adaptive Ising jobs. The second jobstarts 30 seconds after the first job. The experimentis run on a Sun Ultra 5 with 256 MB ofRAM, under Solaris 9.Figure 4. Benefits of evicting partially swappedout panels. The chart shows the times forthe first 10 sweeps of a memory-adaptive Isingjob running with a 60 MB lattice on a Linuxmachine with 128 MB of RAM. After 12 seconds,a dummy job that uses 60 MB of memorybegins. The job that evicts panels withthe largest number of missing pages, labeledevict most missing, has lower and less variabletimes than the original MRU replacement,labeled evict mru greedy.The litmus test for MMLIB is when multiple instancesof applications employing the library are able to coexiston a machine without thrashing the memory system. Figure3 shows the resident set size (RSS) over time for twoinstances of the memory adaptive Ising code running simultaneouslyon a Sun Ultra 5 node. After the job thatstarts first has completed at least one sweep through thelattice, the second job starts. Both jobs have 150 MBrequirements, but memory pressure varies temporally.The circles in the curves denote the beginning of latticesweeps. Distances between consecutive circles along acurve indicate the time of each sweep.The results show that the two adaptive codes run togetherharmoniously without constantly evicting eachother from memory and the jobs reach an equilibriumwhere system memory utilization is high and throughputis sustained without thrashing. The system does not allowmore than about 170 MB for all jobs, and it tends to favorthe application that starts first. A similar phenomenonwas observed in Linux. We emphasize that the intricaciesof the memory allocation policy of the OS are orthogonalto the policies of MMLIB. MMLIB allows jobs to utilizeas efficiently as possible the available memory, notto claim more memory than what is given to each applicationby the OS.Figure 4 shows the benefits of our proposed “mostmissing”eviction policy. After external memory pressurestarts, the job that uses strictly the MRU policy has highlyvariable performance initially because the evicted MRUpanels usually do not coincide with the system evictedpages. The most missing policy is much better duringthat phase. After the algorithm adapts to the availablememory the most missing policy reverts automatically toMRU.Because all three of the applications we have testedemploy memory access patterns for which MRU replacementis appropriate (and are thus very poorly served bythe LRU-like algorithms employed by virtual memorysystems), one might wonder if all of the performancegains provided by MMLIB are attributable to the MRUreplacement that it enables. To test this notion, we ranthe MMLIB-enabled CG under memory pressure and instructedMMLIB to use LRU replacement. Figure 5 comparesthe performance of CG using the wrong (LRU) replacementpolicy with the CG correctly employing MRU.The version using LRU replacement performs significantlyworse, requiring roughly twice the amount of timerequired by the MRU version to perform one iteration.However, when compared to the performance of in-coreCG under memory pressure, the code using the wrong replacementpolicy still performs iterations in roughly halfthe time of in-core CG at lower levels of memory pressure,and at higher levels performs even better. As discussedin the designing goals of MMLIB, the benefits inthis case come strictly from the large granularity of panelaccess, as opposed to the page-level granularity of thevirtual memory system, and from avoiding thrashing.
4035MMLIB-CG-LRU-wrongMMLIB-CG-MRU-correctCG-incorePerformance of MMLIB-CG using wrong access patternAverage time for CG iteration (sec)30252015105010 20 30 40 50 60 70 80Size of locked region (MB)Figure 5. Performance of memory-adaptive CGversus memory pressure when using the wrongpanel replacement policy, which is depicted bythe curve labeled MMLIB-CG-LRU-wrong. Theappropriate replacement policy is MRU, but LRUis used instead; any observed performancegains are not due to the ability of MMLIB to allowapplication-specific replacement. For comparison,the performance of memory-adaptive correctlyemploying MRU is depicted by the curvelabeled MMLIB-CG-MRU-correct. Jobs run understatic memory pressure provided by lockinga region of memory in-core on a Linux 2.4 systemwith 128 MB of RAM.Bandwidth (Mbps)Bandwidth (Mbps)90020M buffer800 50M buffer100M buffer700150M buffer200M buffer600250M buffer500400300200100064 128 256 512 1024 2048 4096 8192 1638490020M buffer800 50M buffer100M buffer700150M buffer200M buffer600250M buffer500400300200100Block Size (K bytes)5.3 Remote memory resultsThe remote memory experiments are conducted on theSciClone cluster  at William & Mary. All programsare linked with the MPICH GM package and the communicationsare routed via a Myrinet 1280 switch. Weuse dual-cpu Sun Ultra 60 workstations at 360MHz with512MB memory, running Solaris 220.127.116.11 Microbenchmark resultsThese experiments help us understand the effect of variousallocation schemes (malloc, named mmap, anonymousmmap) on the MPI Recv performance, undervarious levels of memory pressure. Figure 6 showsthree graphs corresponding to the three methods forallocating the receiving block of the MPI Recv call.Each graph contains six curves plotting the perceivedMPI Recv bandwidth for six different total ‘buffers’ thatthe MPI Recv tries to fill by receiving ‘Block Size’ bytesat a time. This microbenchmark simulates an MMLIBprocess that uses remote memory to bring each one ofthe panels (of ‘Block Size’ each) of a memory object (of‘buffer’ size). On the same node with the receiving process,there is a competing process reading a 300MB fileBandwidth (Mbps)064 128 256 512 1024 2048 4096 8192 16384900800700600500400300200100Block Size (K bytes)250M buffer064 128 256 512 1024 2048 4096 8192 16384Block Size (K bytes)20M buffer50M buffer100M buffer150M buffer200M bufferFigure 6. MPI Recv perceived bandwidth microbenchmarkresults. The charts show the perceivedMPI Recv bandwidth vs. receiving blocksize for six different total buffer sizes. The receivingblock is allocated using named mmap()(top chart), anonymous mmap() (middle chart),or malloc() (bottom chart).
from the local disk. The larger the total ‘buffer’ size, themore severe the memory pressure on the node. The remotememory server process is always ready to send therequested data.The top chart suggests that named mapping is theworst choice for allocating MPI Recv buffers, especiallyunder heavy memory pressure, which is exactly whenMMLIB is needed. In fact when the receiving block issmall, performance is bad even without memory pressure(e.g., all curves for block size of 256 KB). With largeblock sizes bandwidth increases but only when the total‘buffer’ does not cause memory pressure (e.g., the 20MB‘buffer’ curve). A reason for this is that receiving a remotepanel causes a write-out to its backing store on thelocal disk, even though the two may be identical.For both the middle and bottom charts, all six curvesare very similar, suggesting that allocation using anonymousmmap() or malloc() is not very sensitive to memorypressure. The MPI Recv bandwidth performance of usingmalloc() for memory allocation is better than that ofusing anonymous mmap(). As no other processes wereusing the network during the experiments, the perceiveddifferences in bandwidth must be due to different mechanismsof Solaris 9 for copying memory from system touser space.Although these microbenchmarks suggest malloc() asthe allocator of choice for implementing remote memory,experiments with CG favor anonymous mapping, withmalloc() demonstrating unpredictable behavior.5.3.2 CG application resultsTo demonstrate the remote memory capability of MM-LIB, we performed two sets of experiments with the CGapplication on the same computing platform as in microbenchmarkexperiments.In the first set, the MMLIB enabled CG applicationruns on one local node. It works on a 200MB matrix,which is equally partitioned into 20 panels. We createvarious levels of memory pressure on the local node byrunning in addition the in-core CG application (withoutMMLIB) with 300MB, 150MB, and 100MB memory requirements.The experimental results are shown in Table1. There are two rows of data for each of the threememory allocation methods. In the first row, we run theMMLIB CG without remote memory; in the second row,we run MMLIB CG with remote memory capability.First, we see that under named mmap(), performancefor remote mode is inferior to local mode, while underanonymous mmap() mode and malloc(), remote mode isobviously superior to local mode, especially when memorypressure is severe. The results also confirm the microbenchmarkobservations that named mmap() is thewrong choice for remote memory.In contrast to the microbenchmark results, however,remote mode performance is better under anonymousmmap() than under malloc(). There are two reasons forthis. The most important reason is that on Solaris 9 malloc()extends the data segment by calling brk(), ratherthan by calling mmap(). Because MMLIB issues a seriesof malloc() and free() calls, especially when there isheavy memory pressure, the unmapped panels may notbe readily available for use by the system. This causesthe runtime scheduler to think that there is not enoughmemory and thus to allocate less resources to the executingprocess. In our experiments on the dual cpu Suns,we noticed that the Solaris scheduler gave less than 25%cpu time to the application in malloc() mode, while itgave close to 50% cpu time to the one in anonymousmmap() mode. Interestingly, the local MMLIB implementationwithout remote memory demonstrates even aworse behavior, suggesting that, on Solaris, the use ofmalloced segments should be avoided for highly dynamicI/O cases. The second reason is that the mincore()system call does not work on memory segments allocatedby malloc(). Therefore, MMLIB cannot obtain accurateestimates of the static memory to adapt to memory variabilities.In the second set of the experiments, we let two MM-LIB CG applications run against each other to measurethe performance advantages of using remote memoryover local disk in a completely dynamic setting. BothMMLIB CG applications use anonymous mmap() to allocatememory and work on different matrices of equalsize. Each matrix is equally partitioned into 10MB panels.The experimental results are shown in Table 2. Whenremote memory instead of local disk is used, the totalwall-clock time of the two MMLIB CG applications isalways reduced. Especially when the overall memorypressure is severe such as the 300MB matrix and 250MBmatrix cases (the overall memory requirement for data is600MB and 500MB), the wall-clock time reduction canbe 22.4% and 28.7% respectively.We emphasize that these improvements are on top ofthe improvements provided by the local disk MMLIBover the simple use of virtual memory. Considering alsothe improvements from Figure 2, our remote memory libraryimproves local virtual memory performance by afactor of between four and seven. This compares favorablywith factors of two or three reported in other remotememory research .6 ConclusionsWe presented a general framework and supporting librarythat allows scientific applications to automaticallymanage their memory requirements at runtime, thus executingoptimally under variable memory availability.
Wall-clock time for MMLIB CG running against in-core CGmemory pressure300MB 150MB 100MB Without Pressure❵❵❵❵❵❵❵❵❵❵❵modenamed mmap() local 388.097 326.642 309.365 285.135named mmap() remote 484.34 472.831 371.277 289.617anonymous mmap() local 541.005 367.357 317.726 294.406anonymous mmap() remote 379.114 360.516 317.756 293.213malloc() local 1325.485 1023.782 1059.640 1050.507malloc() remote 534.229 504.043 475.662 462.117Table 1. This table shows the wall-clock time for MMLIB CG running against in-core CG, in six modescorresponding to the six rows in the table. CG works on a 200MB matrix on local node. Memory pressureis created by in-core CG running on local node. Without Pressure means in-core CG is not running.Wall-clock time for MMLIB CG running against MMLIB CGmatrix size300MB 250MB 200MB 150MB 100MB❵❵❵❵❵❵❵❵❵❵❵processescg1 local 1496.700 1116.355 413.500 265.929 181.252cg2 local 1015.495 755.859 638.448 266.697 185.796Total local 2512.195 1872.214 1051.948 532.626 367.048cg1 remote 1139.011 815.240 415.056 266.049 158.954cg2 remote 809.446 519.031 603.588 252.822 157.289Total remote 1948.457 1334.271 1018.644 518.871 316.243Remote time reduction over local 22.4% 28.7% 3.16% 2.58% 13.8%Table 2. This table shows the wall-clock time for MMLIB CG running against MMLIB CG. Both MMLIB CGapplications use anonymous mmap() to allocate memory. The first three rows show the results when localdisk is used by both MMLIB CG applications. The following three rows show the results when remotememory is used by both MMLIB CG applications. The last row shows the total wall-clock time reduction forremote over local mode.The library is highly transparent, requiring minimal codemodifications and only at a large granularity level.This paper extends our previous simplified frameworkand adaptation algorithm for memory malleability withthe following key functionalities: (a) multiple and simultaneousread/write memory objects, active panels, and accesspatterns, (b) automatic and accurate estimation ofthe size of the non-managed memory, and (c) applicationlevel remote memory capability.We showed how each of the new functionalities (a)and (b) were necessary in implementing three commonscientific applications, and how (c) has signigicant performanceadvantages over system-level remote memory.Moreover, the remote memory functionality has opened ahost of new possibilities for load and memory balancingin COWs and MPPs, that will be explored in future research.Our experimental results with MMLIB, have confirmedits adaptivity and near optimality in performance.References SciClone cluster project at the College of William andMary. 2005. A. Acharya, G. Edjlali, and J. Saltz. The utility of exploitingidle workstations for parallel computation. volume 25,pages 225–234, 1997. A. Acharya and S. Setia. Availability and Utility of IdleMemory in Workstation Clusters. In Proc. of the 1999ACM SIGMETRICS Joint International Conference onMeasurement and Modeling of Computer Systems (SIG-METRICS’99), pages 35–46, Atlanta, Georgia, May 1999. A. Acharya, M. Uysal, R. Bennett, A. Mendelson,M. Beynon, J. Hollingsworth, J. Saltz, and A. Sussman.Tuning the Performance of I/O Intensive Parallel Applications.In Proc. of the Fourth Workshop on Input/Outputin Parallel and Distributed Systems (IOPADS’96), pages15–27, Philadelphia, PA, May 1996. A. Barak and A. Braverman. Memory Ushering in a ScalableComputing Cluster. Journal of Microprocessors andMicrosystems, 22(3–4):175–182, Aug. 1998. R. D. Barve and J. S. Vitter. A theoretical frameworkfor memory-adaptive algorithms. In IEEE Symposium onFoundations of Computer Science, pages 273–284, 1999. A. D. Brown and T. C. Mowry. Taming the memoryhogs: Using compiler-inserted releases to manage phys-
ical memory intelligently. In Proceedings of the 4th Symposiumon Operating Systems Design and Implementation(OSDI-00), pages 31–44, 2000. F. Chang, A. Itzkovitz, and V. Karamcheti. User-LevelResource Constrained Sandboxing. In Proc. of the 4thUSENIX Windows Systems Symposium, pages 25–36,Seattle, WA, Aug. 2000. S. Chiang and M. Vernon. Characteristics of a LargeShared Memory Production Workload. In Proc. 7th Workshopon Job Scheduling Strategies for Parallel Processing(JSSPP’2001), Lecture Notes in Computer Science, Vol.2221, pages 159–187, Cambridge, MA, June. H. Dail, H. Casanova, and F. Berman. A DecoupledScheduling Approach for the GrADS Program DevelopmentEnvironment. In Proc. of the IEEE/ACM Supercomputing’02:High Performance Networking and ComputingConference (SC’02), Baltimore, MD, Nov. 2002. M. Feeley, W. Morgan, F. Pighin, A. Karlin, H. Levy, ,and C. Thekkat. Implementing global memory managementin a workstation cluster. In 15th ACM Symposiumon Operating Systems Principles(SOSP-15), pages 201–212, 1995. M. Flouris and E. Markatos. Network RAM. In HighPerformance Cluster Computing, pages 383–408. PrenticeHall, 1999. J. Frey, T. Tannenbaum, M. Livny, I. Foster, andS. Tuecke. Condor-G: A Computation ManagementAgent for Multi-Institutional Grids. In Proc. of the 10thIEEE International Symposium on High Performance DistributedComputing (HPDC-10), pages 55–63, San Francisco,California, Aug. 2001. L. Iftode. Home-Based Shared Virtual Memory. PhD thesis,Princeton University, June 1998. L. Iftode, K. Petersen, and K. Li. Memory Serversfor Multicomputers. In Proc. of the IEEE 1993 SpringConference on Computers and Communications (COMP-CON’93), pages 538–547, Feb. 1993. S. Koussih, A. Acharya, and S. Setia. Dodo: A user-levelsystem for exploiting idle memory in workstation clusters.In HPDC, 1999. E. P. Markatos and G. Dramitinos. Implementation of areliable remote memory pager. In USENIX Annual TechnicalConference, pages 177–190, 1996. S. Narravula, H. W. Jin, K. Vaidyanathan and D. K. Panda.Designing Efficient Cooperative Caching Schemes forMulti-Tier Data Centers over RDMA-enabled Networks.In Proc. of the 6th IEEE/ACM International Symposiumon Cluster Computing and the Grid, Singapore, May2006. J. S. Plank, K. Li and M. Puening. Diskless Checkpointing.IEEE Transactions on Parallel and Distributed Systems,9(10), pp. 972–986, October 1998. R. T. Mills. Dynamic adaptation to cpu and memoryload in scientific applications. PhD thesis, Departmentof Computer Science, College of William and Mary, Fall,2004. R. T. Mills, A. Stathopoulos, and D. S. Nikolopoulos.Adapting to memory pressure from within scientific applicationson multiprogrammed COWs. In InternationalParallel and Distributed Processing Symposium (IPDPS2004), Santq Fe, NM, USA, 2004. R. T. Mills, A. Stathopoulos, and E. Smirni. Algorithmicmodifications to the Jacobi-Davidson parallel eigensolverto dynamically balance external CPU and memoryload. In 2001 International Conference on Supercomputing,pages 454–463. ACM Press, 2001. J. Nieplocha, M. Krishnan, B. Palmer, V. Tipparaju, andY. Zhang. Exploiting processor groups to extend scalabilityof the ga shared memory programming model. In ACMComputing Frontiers, Italy, 2005. D. Nikolopoulos. Malleable Memory Mapping: User-Level Control of Memory Bounds for Effective ProgramAdaptation. In Proc. of the 17th IEEE/ACM InternationalParallel and Distributed Processing Symposium(IPDPS’03), Nice, France, Apr. 2003. D. Nikolopoulos and C. Polychronopoulos. AdaptiveScheduling under Memory Pressure on MultiprogrammedClusters. In Proc. of the 2nd IEEE/ACM InternationalConference on Cluster Computing and the Grid (cc-Grid’02), pages 22–29, Berlin, Germany, May 2002. J. Oleszkiewicz, L. Xiao, and Y. Liu. Parallel networkram: Effectively utilizing global cluster memory for largedata-intensive parallel programs. In 2004 InternationalConference on Parallel Processing (ICPP’2004), pages353–360, 2004. H. Pang, M. J. Carey, and M. Livny. Memory-adaptive externalsorting. In R. Agrawal, S. Baker, and D. A. Bell, editors,19th International Conference on Very Large DataBases, August 24-27, 1993, Dublin, Ireland, Proceedings,pages 618–629. Morgan Kaufmann, 1993. Y. Saad. SPARSKIT: A basic toolkit for sparse matrixcomputations. Technical Report 90-20, Research Institutefor Advanced Computer Science, NASA Ames ResearchCenter, Moffet Field, CA, 1990. Software currently availableat . A. Stathopoulos, S. Öğüt, Y. Saad, J. R. Chelikowsky, andH. Kim. Parallel methods and tools for predicting materialproperties. Computing in Science and Engineering,2(4):19–32, 2000. S. Vadhiyar and J. Dongarra. A Performance OrientedMigration Framework for the Grid. Technical Report, InnovativeComputing Laboratory, University of Tennessee,Knoxville, 2002. G. Voelker, E. Anderson, T. Kimbrel, M. Feeley, J. Chase,A. Karlin, and H. Levy. Implementing CooperativePrefetching and Caching in a Globally-Managed MemorySystem. pages 33–43, Madison, Wisconsin, June 1999. P. R. Woodward, S. Anderson, D. H. Porter and A. Iyer.Cluster Computing in the SHMOD Framework on theNSF TeraGrid. Technical Report, Laboratory for ComputationalScience and Engineering, University of Minnesota,April 2004.