13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

INTEL® <strong>64</strong> AND <strong>IA</strong>-<strong>32</strong> PROCESSOR ARCHITECTURES2.1.4 Intel ® Advanced Memory AccessThe Intel Core microarchitecture contains an instruction cache <strong>and</strong> a first-level datacache in each core. The two cores share a 2 or 4-MByte L2 cache. All caches arewriteback <strong>and</strong> non-inclusive. Each core contains:• L1 data cache, known as the data cache unit (DCU) — The DCU can h<strong>and</strong>lemultiple outst<strong>and</strong>ing cache misses <strong>and</strong> continue to service incoming stores <strong>and</strong>loads. It supports maintaining cache coherency. The DCU has the following specifications:— <strong>32</strong>-KBytes size— 8-way set associative— <strong>64</strong>-bytes line size• Data translation lookaside buffer (DTLB) — The DTLB in Intel Core microarchitectureimplements two levels of hierarchy. Each level of the DTLB havemultiple entries <strong>and</strong> can support either 4-KByte pages or large pages. The entriesof the inner level (DTLB0) is used for loads. The entries in the outer level (DTLB1)support store operations <strong>and</strong> loads that missed DTLB0. All entries are 4-wayassociative. Here is a list of entries in each DTLB:— DTLB1 for large pages: <strong>32</strong> entries— DTLB1 for 4-KByte pages: 256 entries— DTLB0 for large pages: 16 entries— DTLB0 for 4-KByte pages: 16 entriesAn DTLB0 miss <strong>and</strong> DTLB1 hit causes a penalty of 2 cycles. Software only paysthis penalty if the DTLB0 is used in some dispatch cases. The delays associatedwith a miss to the DTLB1 <strong>and</strong> PMH are largely non-blocking due to the design ofIntel Smart Memory Access.• Page miss h<strong>and</strong>ler (PMH)• A memory ordering buffer (MOB) — Which:— enables loads <strong>and</strong> stores to issue speculatively <strong>and</strong> out of order— ensures retired loads <strong>and</strong> stores have the correct data upon retirement— ensures loads <strong>and</strong> stores follow memory ordering rules of the Intel <strong>64</strong> <strong>and</strong><strong>IA</strong>-<strong>32</strong> architectures.The memory cluster of the Intel Core microarchitecture uses the following to speedup memory operations:• 128-bit load <strong>and</strong> store operations• data prefetching to L1 caches• data prefetch logic for prefetching to the L2 cache• store forwarding• memory disambiguation2-13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!