What Every Programmer Should Know About Memory

More documents

Recommendations

Info

AMD’s Pacifica extension introduced the ASID to avoid TLB flushes on each entry. The number of bits for the ASID is one in the initial release of the processor extensions; this is just enough to differentiate VMM and guest OS. Intel has virtual processor IDs (VPIDs) which serve the same purpose, only there are more of them. But the VPID is fixed for each guest domain and therefore it cannot be used to mark separate processes and avoid TLB flushes at that level, too. The amount of work needed for each address space modification is one problem with virtualized OSes. There is another problem inherent in VMM-based virtualization, though: there is no way around having two layers of memory handling. But memory handling is hard (especially when taking complications like NUMA into account, see section 5). The Xen approach of using a separate VMM makes optimal (or even good) handling hard since all the complications of a memory management implementation, including “trivial” things like discovery of memory regions, must be duplicated in the VMM. The OSes have fully-fledged and optimized implementations; one really wants to avoid duplicating them. Guest Kernel Userlevel Process KVM VMM Linux Kernel Guest Kernel KVM VMM Figure 4.5: KVM Virtualization Model This is why carrying the VMM/Dom0 model to its conclusion is such an attractive alternative. Figure 4.5 shows how the KVM Linux kernel extensions try to solve the problem. There is no separate VMM running directly on the hardware and controlling all the guests; instead, a normal Linux kernel takes over this functionality. This means the complete and sophisticated memory handling functionality in the Linux kernel is used to manage the memory of the system. Guest domains run alongside the normal user-level processes in what the creators call “guest mode”. The virtualization functionality, para- or full virtualization, is controlled by the KVM VMM. This is just another userlevel process which happens to control a guest domain using the special KVM device the kernel implements. The benefit of this model over the separate VMM of the Xen model is that, even though there are still two memory handlers at work when guest OSes are used, there only needs to be one implementation, that in the Linux kernel. It is not necessary to duplicate the same functionality in another piece of code like the Xen VMM. This leads to less work, fewer bugs, and, perhaps, less friction where the two memory handlers touch since the memory handler in a Linux guest makes the same assumptions as the memory handler in the outer Linux kernel which runs on the bare hardware. Overall, programmers must be aware that, with virtualization used, the cost of cache misses (instruction, data, or TLB) is even higher than without virtualization. Any optimization which reduces this work will pay off even more in virtualized environments. Processor designers will, over time, reduce the difference more and more through technologies like EPT and NPT but it will never completely go away. 42 Version 1.0 What Every Programmer Should Know About Memory
5 NUMA Support In section 2 we saw that, on some machines, the cost of access to specific regions of physical memory differs depending on where the access originated. This type of hardware requires special care from the OS and the applications. We will start with a few details of NUMA hardware, then we will cover some of the support the Linux kernel provides for NUMA. 5.1 NUMA Hardware Non-uniform memory architectures are becoming more and more common. In the simplest form of NUMA, a processor can have local memory (see Figure 2.3) which is cheaper to access than memory local to other processors. The difference in cost for this type of NUMA system is not high, i.e., the NUMA factor is low. NUMA is also–and especially–used in big machines. We have described the problems of having many processors access the same memory. For commodity hardware all processors would share the same Northbridge (ignoring the AMD Opteron NUMA nodes for now, they have their own problems). This makes the Northbridge a severe bottleneck since all memory traffic is routed through it. Big machines can, of course, use custom hardware in place of the Northbridge but, unless the memory chips used have multiple ports–i.e. they can be used from multiple busses–there still is a bottleneck. Multiport RAM is complicated and expensive to build and support and, therefore, it is hardly ever used. The next step up in complexity is the model AMD uses where an interconnect mechanism (Hyper Transport in AMD’s case, technology they licensed from Digital) provides access for processors which are not directly connected to the RAM. The size of the structures which can be formed this way is limited unless one wants to increase the diameter (i.e., the maximum distance between any two nodes) arbitrarily. 2 1 C = 1 2 1 C = 2 4 3 5 Figure 5.1: Hypercubes 7 2 4 1 3 6 8 C = 3 An efficient topology for connecting the nodes is the hypercube, which limits the number of nodes to 2 C where C is the number of interconnect interfaces each node has. Hypercubes have the smallest diameter for all systems with 2 n CPUs and n interconnects. Figure 5.1 shows the first three hypercubes. Each hypercube has a diameter of C which is the absolute minimum. AMD’s firstgeneration Opteron processors have three hypertransport links per processor. At least one of the processors has to have a Southbridge attached to one link, meaning, currently, that a hypercube with C = 2 can be implemented directly and efficiently. The next generation will at some point have four links, at which point C = 3 hypercubes will be possible. This does not mean, though, that larger accumulations of processors cannot be supported. There are companies which have developed crossbars allowing larger sets of processors to be used (e.g., Newisys’s Horus). But these crossbars increase the NUMA factor and they stop being effective at a certain number of processors. The next step up means connecting groups of CPUs and implementing a shared memory for all of them. All such systems need specialized hardware and are by no means commodity systems. Such designs exist at several levels of complexity. A system which is still quite close to a commodity machine is IBM x445 and similar machines. They can be bought as ordinary 4U, 8-way machines with x86 and x86-64 processors. Two (at some point up to four) of these machines can then be connected to work as a single machine with shared memory. The interconnect used introduces a significant NUMA factor which the OS, as well as applications, must take into account. At the other end of the spectrum, machines like SGI’s Altix are designed specifically to be interconnected. SGI’s NUMAlink interconnect fabric is very fast and has low latency at the same time; both properties are requirements for high-performance computing (HPC), specifically when Message Passing Interfaces (MPI) are used. The drawback is, of course, that such sophistication and specialization is very expensive. They make a reasonably low NUMA factor possible but with the number of CPUs these machines can have (several thousands) and the limited capacity of the interconnects, the NUMA factor is actually dynamic and can reach unacceptable levels depending on the workload. More commonly used are solutions where many commodity machines are connected using high-speed networking to form a cluster. These are no NUMA machines, though; they do not implement a shared address space and therefore do not fall into any category which is discussed here. 5.2 OS Support for NUMA To support NUMA machines, the OS has to take the distributed nature of the memory into account. For instance, if a process is run on a given processor, the physical RAM assigned to the process’s address space should ideally come from local memory. Otherwise each instruction has to access remote memory for code and data. There are special cases to be taken into account which are only present in NUMA machines. The text segment of DSOs is normally present exactly once in a machine’s Ulrich Drepper Version 1.0 43
Page 1 and 2: 1 Introduction What Every Programme
Page 3 and 4: 2 Commodity Hardware Today It is im
Page 5 and 6: The following two sections discuss
Page 7 and 8: 30 impulses on the address lines sy
Page 9 and 10: ple, DDR2 works the same although i
Page 11 and 12: keters came up with for DDR2 are si
Page 13 and 14: ford to pipe all the data they need
Page 15 and 16: L3 Cache Main Memory Bus L2 Cache L
Page 17 and 18: a large part (probably even the maj
Page 19 and 20: L2 Associativity Cache Direct 2 4 8
Page 21 and 22: Cycles/List Element Cycles/List Ele
Page 23 and 24: Cycles/List Element Cycles/List Ele
Page 25 and 26: (in one test we will see later up t
Page 27 and 28: From this description of the state
Page 29 and 30: Speed-Up 4.5 4 3.5 3 2.5 2 1.5 1 0.
Page 31 and 32: leaving much of the cache unused. I
Page 33 and 34: is sequential prefetching can predi
Page 35 and 36: and the memory controller can reque
Page 37 and 38: 4 Virtual Memory The virtual memory
Page 39 and 40: So, instead of just caching the dir
Page 41: $ eu-readelf -l /bin/ls Program Hea
Page 45 and 46: type level shared_cpu_map index0 Da
Page 47 and 48: Slowdown Vs Local Memory 15% 10% 5%
Page 49 and 50: On the read side, processors, until
Page 51 and 52: Original Transposed Sub-Matrix Vect
Page 53 and 54: The function stores a pointer point
Page 55 and 56: 10 9 8 7 6 5 4 3 2 1 0 z 16 32 48 6
Page 57 and 58: places. If the L1i content can be r
Page 59 and 60: • L2 caches and higher are often
Page 61 and 62: fetcher that the processor should n
Page 63 and 64: ld8.c.clr r6 = [r8];; add r5 = r6,
Page 65 and 66: memory without involving the CPU. T
Page 67 and 68: it is possible to guarantee that no
Page 69 and 70: Here we have to use a special load
Page 71 and 72: #define _GNU_SOURCE #include int s
Page 73 and 74: on Linux is wholly inadequate for t
Page 75 and 76: If both the MPOL_MF_STRICT and MPOL
Page 77 and 78: dex returned by the NUMA_memnode_se
Page 79 and 80: Cache Miss Ratio 22% 20% 18% 16% 14
Page 81 and 82: ==19645== I refs: 152,653,497 ==196
Page 83 and 84: src/readelf -a -w src/readelf 106,0
Page 85 and 86: supposed to be read from or written
Page 87 and 88: here is that the linker will put th
Page 89 and 90: still fits completely into one sing
Page 91 and 92: element of the LIFO are determined
Page 93 and 94:
new list member. This is fine since
Page 95 and 96:
use siglongjmp to jump to an outer
Page 97 and 98:
A Examples and Benchmark Programs A
Page 99 and 100:
3. allocate one word in the predict
Page 101 and 102:
} void *p; posix_memalign(&p, 64, (
Page 103 and 104:
even possible to differentiate DSOs
Page 105 and 106:
C Memory Types Though it is not nec
Page 107 and 108:
= 1 0 0 1 0 0 1 Performing the te
Page 109 and 110:
size_t cpusetsize, cpu_set_t *cpuse
Page 111 and 112:
E Index ABI, . . . . . . . . . . .
Page 113 and 114:
F Bibliography [1] Performance Guid
show all

What Every Programmer Should Know About Memory

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?