13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

INTEL® <strong>64</strong> AND <strong>IA</strong>-<strong>32</strong> PROCESSOR ARCHITECTURESSoftware can proactively control data access pattern to favor smaller access strides(e.g., stride that is less than half of the trigger threshold distance) over larger accessstrides (stride that is greater than the trigger threshold distance), this can achieveadditional benefit of improved temporal locality <strong>and</strong> reducing cache misses in the lastlevel cache significantly.Software optimization of a data access pattern should emphasize tuning for hardwareprefetch first to favor greater proportions of smaller-stride data accesses in theworkload; before attempting to provide hints to the processor by employing softwareprefetch instructions.2.2.4.5 Loads <strong>and</strong> StoresThe Pentium 4 processor employs the following techniques to speed up the executionof memory operations:• speculative execution of loads• reordering of loads with respect to loads <strong>and</strong> stores• multiple outst<strong>and</strong>ing misses• buffering of writes• forwarding of data from stores to dependent loadsPerformance may be enhanced by not exceeding the memory issue b<strong>and</strong>width <strong>and</strong>buffer resources provided by the processor. Up to one load <strong>and</strong> one store may beissued for each cycle from a memory port reservation station. In order to bedispatched to a reservation station, there must be a buffer entry available for eachmemory operation. There are 48 load buffers <strong>and</strong> 24 store buffers 3 . These buffershold the µop <strong>and</strong> address information until the operation is completed, retired, <strong>and</strong>deallocated.The Pentium 4 processor is designed to enable the execution of memory operationsout of order with respect to other instructions <strong>and</strong> with respect to each other. Loadscan be carried out speculatively, that is, before all preceding branches are resolved.However, speculative loads cannot cause page faults.Reordering loads with respect to each other can prevent a load miss from stallinglater loads. Reordering loads with respect to other loads <strong>and</strong> stores to differentaddresses can enable more parallelism, allowing the machine to execute operationsas soon as their inputs are ready. Writes to memory are always carried out inprogram order to maintain program correctness.A cache miss for a load does not prevent other loads from issuing <strong>and</strong> completing.The Pentium 4 processor supports up to four (or eight for Pentium 4 processor withCPUID signature corresponding to family 15, model 3) outst<strong>and</strong>ing load misses thatcan be serviced either by on-chip caches or by memory.3. Pentium 4 processors with CPUID model encoding equal to 3 have more than 24 store buffers.2-31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!