What Every Programmer Should Know About Memory

More documents

Recommendations

Info

and must wait to access memory, despite the use of CPU caches. If multiple hyper-threads, cores, or processors access memory at the same time, the wait times for memory access are even longer. This is also true for DMA operations. There is more to accessing memory than concurrency, however. Access patterns themselves also greatly influence the performance of the memory subsystem, especially with multiple memory channels. In section 2.2 we wil cover more details of RAM access patterns. On some more expensive systems, the Northbridge does not actually contain the memory controller. Instead the Northbridge can be connected to a number of external memory controllers (in the following example, four of them). RAM RAM MC1 MC2 PCI-E CPU1 CPU2 Northbridge Southbridge MC3 MC4 SATA USB RAM RAM Figure 2.2: Northbridge with External Controllers The advantage of this architecture is that more than one memory bus exists and therefore total available bandwidth increases. This design also supports more memory. Concurrent memory access patterns reduce delays by simultaneously accessing different memory banks. This is especially true when multiple processors are directly connected to the Northbridge, as in Figure 2.2. For such a design, the primary limitation is the internal bandwidth of the Northbridge, which is phenomenal for this architecture (from Intel). 4 Using multiple external memory controllers is not the only way to increase memory bandwidth. One other increasingly popular way is to integrate memory controllers into the CPUs and attach memory to each CPU. This architecture is made popular by SMP systems based on AMD’s Opteron processor. Figure 2.3 shows such a system. Intel will have support for the Common System Interface (CSI) starting with the Nehalem processors; this is basically the same approach: an integrated memory controller with the possibility of local memory for each processor. With an architecture like this there are as many memory banks available as there are processors. On a quad-CPU machine the memory bandwidth is quadrupled without the need for a complicated Northbridge with enormous bandwidth. Having a memory controller integrated into the CPU has some additional advantages; we will not dig 4 For completeness it should be mentioned that such a memory controller arrangement can be used for other purposes such as “memory RAID” which is useful in combination with hotplug memory. RAM RAM PCI-E CPU1 CPU3 CPU2 CPU4 Southbridge RAM RAM SATA USB Figure 2.3: Integrated Memory Controller deeper into this technology here. There are disadvantages to this architecture, too. First of all, because the machine still has to make all the memory of the system accessible to all processors, the memory is not uniform anymore (hence the name NUMA - Non-Uniform Memory Architecture - for such an architecture). Local memory (memory attached to a processor) can be accessed with the usual speed. The situation is different when memory attached to another processor is accessed. In this case the interconnects between the processors have to be used. To access memory attached to CPU2 from CPU1 requires communication across one interconnect. When the same CPU accesses memory attached to CPU4 two interconnects have to be crossed. Each such communication has an associated cost. We talk about “NUMA factors” when we describe the extra time needed to access remote memory. The example architecture in Figure 2.3 has two levels for each CPU: immediately adjacent CPUs and one CPU which is two interconnects away. With more complicated machines the number of levels can grow significantly. There are also machine architectures (for instance IBM’s x445 and SGI’s Altix series) where there is more than one type of connection. CPUs are organized into nodes; within a node the time to access the memory might be uniform or have only small NUMA factors. The connection between nodes can be very expensive, though, and the NUMA factor can be quite high. Commodity NUMA machines exist today and will likely play an even greater role in the future. It is expected that, from late 2008 on, every SMP machine will use NUMA. The costs associated with NUMA make it important to recognize when a program is running on a NUMA machine. In section 5 we will discuss more machine architectures and some technologies the Linux kernel provides for these programs. Beyond the technical details described in the remainder of this section, there are several additional factors which influence the performance of RAM. They are not controllable by software, which is why they are not covered in this section. The interested reader can learn about some of these factors in section 2.1. They are really only needed to get a more complete picture of RAM technology and possibly to make better decisions when purchasing computers. 4 Version 1.0 What Every Programmer Should Know About Memory
The following two sections discuss hardware details at the gate level and the access protocol between the memory controller and the DRAM chips. Programmers will likely find this information enlightening since these details explain why RAM access works the way it does. It is optional knowledge, though, and the reader anxious to get to topics with more immediate relevance for everyday life can jump ahead to section 2.2.5. 2.1 RAM Types There have been many types of RAM over the years and each type varies, sometimes significantly, from the other. The older types are today really only interesting to the historians. We will not explore the details of those. Instead we will concentrate on modern RAM types; we will only scrape the surface, exploring some details which are visible to the kernel or application developer through their performance characteristics. The first interesting details are centered around the question why there are different types of RAM in the same machine. More specifically, why are there both static RAM (SRAM 5 ) and dynamic RAM (DRAM). The former is much faster and provides the same functionality. Why is not all RAM in a machine SRAM? The answer is, as one might expect, cost. SRAM is much more expensive to produce and to use than DRAM. Both these cost factors are important, the second one increasing in importance more and more. To understand these differences we look at the implementation of a bit of storage for both SRAM and DRAM. In the remainder of this section we will discuss some low-level details of the implementation of RAM. We will keep the level of detail as low as possible. To that end, we will discuss the signals at a “logic level” and not at a level a hardware designer would have to use. That level of detail is unnecessary for our purpose here. 2.1.1 Static RAM BL M 2 M5 M1 WL V dd M 4 M 3 M 6 BL Figure 2.4: 6-T Static RAM Figure 2.4 shows the structure of a 6 transistor SRAM cell. The core of this cell is formed by the four transistors M 1 to M 4 which form two cross-coupled inverters. They have two stable states, representing 0 and 1 respectively. The state is stable as long as power on V dd is available. 5 In other contexts SRAM might mean “synchronous RAM”. If access to the state of the cell is needed the word access line WL is raised. This makes the state of the cell immediately available for reading on BL and BL. If the cell state must be overwritten the BL and BL lines are first set to the desired values and then WL is raised. Since the outside drivers are stronger than the four transistors (M 1 through M 4 ) this allows the old state to be overwritten. See [20] for a more detailed description of the way the cell works. For the following discussion it is important to note that • one cell requires six transistors. There are variants with four transistors but they have disadvantages. • maintaining the state of the cell requires constant power. • the cell state is available for reading almost immediately once the word access line WL is raised. The signal is as rectangular (changing quickly between the two binary states) as other transistorcontrolled signals. • the cell state is stable, no refresh cycles are needed. There are other, slower and less power-hungry, SRAM forms available, but those are not of interest here since we are looking at fast RAM. These slow variants are mainly interesting because they can be more easily used in a system than dynamic RAM because of their simpler interface. 2.1.2 Dynamic RAM Dynamic RAM is, in its structure, much simpler than static RAM. Figure 2.5 shows the structure of a usual DRAM cell design. All it consists of is one transistor and one capacitor. This huge difference in complexity of course means that it functions very differently than static RAM. DL AL M Figure 2.5: 1-T Dynamic RAM A dynamic RAM cell keeps its state in the capacitor C. The transistor M is used to guard the access to the state. To read the state of the cell the access line AL is raised; this either causes a current to flow on the data line DL or not, depending on the charge in the capacitor. To write to the cell the data line DL is appropriately set and then AL is raised for a time long enough to charge or drain the capacitor. There are a number of complications with the design of dynamic RAM. The use of a capacitor means that reading Ulrich Drepper Version 1.0 5 C
Page 1 and 2: 1 Introduction What Every Programme
Page 3: 2 Commodity Hardware Today It is im
Page 7 and 8: 30 impulses on the address lines sy
Page 9 and 10: ple, DDR2 works the same although i
Page 11 and 12: keters came up with for DDR2 are si
Page 13 and 14: ford to pipe all the data they need
Page 15 and 16: L3 Cache Main Memory Bus L2 Cache L
Page 17 and 18: a large part (probably even the maj
Page 19 and 20: L2 Associativity Cache Direct 2 4 8
Page 21 and 22: Cycles/List Element Cycles/List Ele
Page 23 and 24: Cycles/List Element Cycles/List Ele
Page 25 and 26: (in one test we will see later up t
Page 27 and 28: From this description of the state
Page 29 and 30: Speed-Up 4.5 4 3.5 3 2.5 2 1.5 1 0.
Page 31 and 32: leaving much of the cache unused. I
Page 33 and 34: is sequential prefetching can predi
Page 35 and 36: and the memory controller can reque
Page 37 and 38: 4 Virtual Memory The virtual memory
Page 39 and 40: So, instead of just caching the dir
Page 41 and 42: $ eu-readelf -l /bin/ls Program Hea
Page 43 and 44: 5 NUMA Support In section 2 we saw
Page 45 and 46: type level shared_cpu_map index0 Da
Page 47 and 48: Slowdown Vs Local Memory 15% 10% 5%
Page 49 and 50: On the read side, processors, until
Page 51 and 52: Original Transposed Sub-Matrix Vect
Page 53 and 54: The function stores a pointer point
Page 55 and 56:
10 9 8 7 6 5 4 3 2 1 0 z 16 32 48 6
Page 57 and 58:
places. If the L1i content can be r
Page 59 and 60:
• L2 caches and higher are often
Page 61 and 62:
fetcher that the processor should n
Page 63 and 64:
ld8.c.clr r6 = [r8];; add r5 = r6,
Page 65 and 66:
memory without involving the CPU. T
Page 67 and 68:
it is possible to guarantee that no
Page 69 and 70:
Here we have to use a special load
Page 71 and 72:
#define _GNU_SOURCE #include int s
Page 73 and 74:
on Linux is wholly inadequate for t
Page 75 and 76:
If both the MPOL_MF_STRICT and MPOL
Page 77 and 78:
dex returned by the NUMA_memnode_se
Page 79 and 80:
Cache Miss Ratio 22% 20% 18% 16% 14
Page 81 and 82:
==19645== I refs: 152,653,497 ==196
Page 83 and 84:
src/readelf -a -w src/readelf 106,0
Page 85 and 86:
supposed to be read from or written
Page 87 and 88:
here is that the linker will put th
Page 89 and 90:
still fits completely into one sing
Page 91 and 92:
element of the LIFO are determined
Page 93 and 94:
new list member. This is fine since
Page 95 and 96:
use siglongjmp to jump to an outer
Page 97 and 98:
A Examples and Benchmark Programs A
Page 99 and 100:
3. allocate one word in the predict
Page 101 and 102:
} void *p; posix_memalign(&p, 64, (
Page 103 and 104:
even possible to differentiate DSOs
Page 105 and 106:
C Memory Types Though it is not nec
Page 107 and 108:
= 1 0 0 1 0 0 1 Performing the te
Page 109 and 110:
size_t cpusetsize, cpu_set_t *cpuse
Page 111 and 112:
E Index ABI, . . . . . . . . . . .
Page 113 and 114:
F Bibliography [1] Performance Guid
show all

What Every Programmer Should Know About Memory

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?