13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING CACHE USAGEoccur at large strides. If the dimensions happen to be powers of 2, aliasing conditiondue to finite number of way-associativity (see “Capacity Limits <strong>and</strong> Aliasing inCaches” in Chapter ) will exacerbate the likelihood of cache evictions.Example 9-8. Using HW Prefetch to Improve Read-Once Memory Traffica) Un-optimized image transpose// dest <strong>and</strong> src represent two-dimensional arraysfor( i = 0;i < NUMCOLS; i ++) {// inner loop reads single columnfor( j = 0; j < NUMROWS ; j ++) {// Each read reference causes large-stride cache missdest[i*NUMROWS +j] = src[j*NUMROWS + i];}}b)// tilewidth = L2SizeInBytes/2/TileHeight/Sizeof(element)for( i = 0; i < NUMCOLS; i += tilewidth) {for( j = 0; j < NUMROWS ; j ++) {// access multiple elements in the same row in the inner loop// access pattern friendly to hw prefetch <strong>and</strong> improves hit ratefor( k = 0; k < tilewidth; k ++)dest[j+ (i+k)* NUMROWS] = src[i+k+ j* NUMROWS];}}Example 9-8 (b) shows applying the techniques of tiling with optimal selection of tilesize <strong>and</strong> tile width to take advantage of hardware prefetch. With tiling, one canchoose the size of two tiles to fit in the last level cache. Maximizing the width of eachtile for memory read references enables the hardware prefetcher to initiate busrequests to read some cache lines before the code actually reference the linearaddresses.9.6.12 Single-pass versus Multi-pass ExecutionAn algorithm can use single- or multi-pass execution defined as follows:• Single-pass, or unlayered execution passes a single data element through anentire computation pipeline.• Multi-pass, or layered execution performs a single stage of the pipeline on abatch of data elements, before passing the batch on to the next stage.9-27

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!