13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MULTICORE AND HYPER-THREADING TECHNOLOGY8.6.1 Cache Blocking TechniqueLoop blocking is useful for reducing cache misses <strong>and</strong> improving memory accessperformance. The selection of a suitable block size is critical when applying the loopblocking technique. Loop blocking is applicable to single-threaded applications aswell as to multithreaded applications running on processors with or without HT Technology.The technique transforms the memory access pattern into blocks that efficientlyfit in the target cache size.When targeting Intel processors supporting HT Technology, the loop blocking techniquefor a unified cache can select a block size that is no more than one half of thetarget cache size, if there are two logical processors sharing that cache. The upperlimit of the block size for loop blocking should be determined by dividing the targetcache size by the number of logical processors available in a physical processorpackage. Typically, some cache lines are needed to access data that are not part ofthe source or destination buffers used in cache blocking, so the block size can bechosen between one quarter to one half of the target cache (see Chapter 3, “General<strong>Optimization</strong> Guidelines”).Software can use the deterministic cache parameter leaf of CPUID to discover whichsubset of logical processors are sharing a given cache (see Chapter 9, “OptimizingCache Usage”). Therefore, guideline above can be extended to allow all the logicalprocessors serviced by a given cache to use the cache simultaneously, by placing anupper limit of the block size as the total size of the cache divided by the number oflogical processors serviced by that cache. This technique can also be applied tosingle-threaded applications that will be used as part of a multitasking workload.User/Source Coding Rule 31. (H impact, H generality) Use cache blocking toimprove locality of data access. Target one quarter to one half of the cache sizewhen targeting Intel processors supporting HT Technology or target a block sizethat allow all the logical processors serviced by a cache to share that cachesimultaneously.8.6.2 Shared-Memory <strong>Optimization</strong>Maintaining cache coherency between discrete processors frequently involvesmoving data across a bus that operates at a clock rate substantially slower that theprocessor frequency.8.6.2.1 Minimize Sharing of Data between Physical ProcessorsWhen two threads are executing on two physical processors <strong>and</strong> sharing data,reading from or writing to shared data usually involves several bus transactions(including snooping, request for ownership changes, <strong>and</strong> sometimes fetching dataacross the bus). A thread accessing a large amount of shared memory is likely tohave poor processor-scaling performance.8-27

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!