13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGE• Take advantage of the hardware prefetcher’s ability to prefetch data that areaccessed in linear patterns, in either a forward or backward direction.• Take advantage of the hardware prefetcher’s ability to prefetch data that areaccessed in a regular pattern with access strides that are substantially smallerthan half of the trigger distance of the hardware prefetch (see Table 2-6).• Use a current-generation compiler, such as the Intel C++ Compiler that supportsC++ language-level features for Streaming SIMD Extensions. Streaming SIMDExtensions <strong>and</strong> MMX technology instructions provide intrinsics that allow you tooptimize cache utilization. Examples of Intel compiler intrinsics include:_mm_prefetch, _mm_stream <strong>and</strong> _mm_load, _mm_sfence. For details, refer toIntel C++ Compiler User’s Guide documentation.• Facilitate compiler optimization by:— Minimize use of global variables <strong>and</strong> pointers.— Minimize use of complex control flow.— Use the const modifier, avoid register modifier.— Choose data types carefully (see below) <strong>and</strong> avoid type casting.• Use cache blocking techniques (for example, strip mining) as follows:— Improve cache hit rate by using cache blocking techniques such as stripmining(one dimensional arrays) or loop blocking (two dimensional arrays)— Explore using hardware prefetching mechanism if your data access patternhas sufficient regularity to allow alternate sequencing of data accesses (forexample: tiling) for improved spatial locality. Otherwise use PREFETCHNTA.• Balance single-pass versus multi-pass execution:— Single-pass, or unlayered execution passes a single data element through anentire computation pipeline.— Multi-pass, or layered execution performs a single stage of the pipeline on abatch of data elements before passing the entire batch on to the next stage.— If your algorithm is single-pass use PREFETCHNTA. If your algorithm is multipassuse PREFETCHT0.• Resolve memory bank conflict issues. Minimize memory bank conflicts byapplying array grouping to group contiguously used data together or byallocating data within 4-KByte memory pages.• Resolve cache management issues. Minimize the disturbance of temporal dataheld within processor’s caches by using streaming store instructions.• Optimize software prefetch scheduling distance:— Far ahead enough to allow interim computations to overlap memory accesstime.— Near enough that prefetched data is not replaced from the data cache.9-2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!