13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEresource stalls. In addition, this mixing reduces the pressure on the memory accessresources <strong>and</strong> in turn reduces the possibility of the prefetch retiring without fetchingdata.Figure 9-6 illustrates distributing PREFETCH instructions. A simple <strong>and</strong> usefulheuristic of prefetch spreading for a Pentium 4 processor is to insert a PREFETCHinstruction every 20 to 25 clocks. Rearranging PREFETCH instructions could yield anoticeable speedup for the code which stresses the cache resource.top_loop:prefetchnta [ebx+128]prefetchnta [ebx+1128]prefetchnta [ebx+2128]prefetchnta [ebx+3128]. . . .. . . .prefetchnta [ebx+17128]prefetchnta [ebx+18128]prefetchnta [ebx+19128]prefetchnta [ebx+20128]movps xmm1, [ebx]addps xmm2, [ebx+3000]mulps xmm3, [ebx+4000]addps xmm1, [ebx+1000]addps xmm2, [ebx+3016]mulps xmm1, [ebx+2000]mulps xmm1, xmm2. . . . . . . .. . . . . .. . . . .add ebx, 128cmp ebx, ecxjl top_loopspread prefetchestop_loop:prefetchnta [ebx+128]movps xmm1, [ebx]addps xmm2, [ebx+3000]mulps xmm3, [ebx+4000]prefetchnta [ebx+1128]addps xmm1, [ebx+1000]addps xmm2, [ebx+3016]prefetchnta [ebx+2128]mulps xmm1, [ebx+2000]mulps xmm1, xmm2prefetchnta [ebx+3128]. . . . . . .. . .prefetchnta [ebx+18128]. . . . . .prefetchnta [ebx+19128]. . . . . .. . . .prefetchnta [ebx+20128]add ebx, 128cmp ebx, ecxjl top_loopFigure 9-6. Spread Prefetch InstructionsNOTETo avoid instruction execution stalls due to the over-utilization of theresource, PREFETCH instructions must be interspersed with computationalinstructions9.6.10 Software Prefetch <strong>and</strong> Cache Blocking TechniquesCache blocking techniques (such as strip-mining) are used to improve temporallocality <strong>and</strong> the cache hit rate. Strip-mining is one-dimensional temporal locality optimizationfor memory. When two-dimensional arrays are used in programs, loopblocking technique (similar to strip-mining but in two dimensions) can be applied fora better memory performance.9-22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!