13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINES— On Pentium 4 <strong>and</strong> Intel Xeon processors with a CPUID signature of familyencoding 15, model encoding 3; the data conflict condition applies toaddresses having identical values in bits 21:6.3.6.7.3 Aliasing Cases in the Pentium M, Intel Core Solo, Intel Core Duo<strong>and</strong> Intel Core 2 Duo ProcessorsPentium M, Intel Core Solo, Intel Core Duo <strong>and</strong> Intel Core 2 Duo processors have thefollowing aliasing case:• Store forwarding — If a store to an address is followed by a load from the sameaddress, the load will not proceed until the store data is available. If a store isfollowed by a load <strong>and</strong> their addresses differ by a multiple of 4 KBytes, the loadstalls until the store operation completes.Assembly/Compiler Coding Rule 55. (H impact, M generality) Avoid having astore followed by a non-dependent load with addresses that differ by a multiple of4 KBytes. Also, lay out data or order computation to avoid having cache lines thathave linear addresses that are a multiple of <strong>64</strong> KBytes apart in the same workingset. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apartin the same first-level cache working set, <strong>and</strong> avoid having more than 8 cache linesthat are some multiple of 4 KBytes apart in the same first-level cache working set.When declaring multiple arrays that are referenced with the same index <strong>and</strong> are eacha multiple of <strong>64</strong> KBytes (as can happen with STRUCT_OF_ARRAY data layouts), padthem to avoid declaring them contiguously. Padding can be accomplished by eitherintervening declarations of other variables or by artificially increasing the dimension.User/Source Coding Rule 8. (H impact, ML generality) Consider using aspecial memory allocation library with address offset capability to avoid aliasing.One way to implement a memory allocator to avoid aliasing is to allocate more thanenough space <strong>and</strong> pad. For example, allocate structures that are 68 KB instead of<strong>64</strong> KBytes to avoid the <strong>64</strong>-KByte aliasing, or have the allocator pad <strong>and</strong> returnr<strong>and</strong>om offsets that are a multiple of 128 Bytes (the size of a cache line).User/Source Coding Rule 9. (M impact, M generality) When padding variabledeclarations to avoid aliasing, the greatest benefit comes from avoiding aliasing onsecond-level cache lines, suggesting an offset of 128 bytes or more.4-KByte memory aliasing occurs when the code accesses two different memory locationswith a 4-KByte offset between them. The 4-KByte aliasing situation can manifestin a memory copy routine where the addresses of the source buffer <strong>and</strong>destination buffer maintain a constant offset <strong>and</strong> the constant offset happens to be amultiple of the byte increment from one iteration to the next.Example 3-38 shows a routine that copies 16 bytes of memory in each iteration of aloop. If the offsets (modular 4096) between source buffer (EAX) <strong>and</strong> destinationbuffer (EDX) differ by 16, <strong>32</strong>, 48, <strong>64</strong>, 80; loads have to wait until stores have beenretired before they can continue. For example at offset 16, the load of the next iterationis 4-KByte aliased current iteration store, therefore the loop must wait until thestore operation completes, making the entire loop serialized. The amount of time3-62

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!