13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGE9.7.2.3 Conclusions from Video Encoder <strong>and</strong> Decoder ImplementationThese two examples indicate that by using an appropriate combination of nontemporalprefetches <strong>and</strong> non-temporal stores, an application can be designed tolessen the overhead of memory transactions by preventing second-level cache pollution,keeping useful data in the second-level cache <strong>and</strong> reducing costly write-backtransactions. Even if an application does not gain performance significantly fromhaving data ready from prefetches, it can improve from more efficient use of thesecond-level cache <strong>and</strong> memory. Such design reduces the encoder’s dem<strong>and</strong> for suchcritical resource as the memory bus. This makes the system more balanced, resultingin higher performance.9.7.2.4 Optimizing Memory Copy RoutinesCreating memory copy routines for large amounts of data is a common task in softwareoptimization. Example 9-9 presents a basic algorithm for a the simple memorycopy.Example 9-9. Basic Algorithm of a Simple Memory Copy#define N 512000double a[N], b[N];for (i = 0; i < N; i++) {b[i] = a[i];}This task can be optimized using various coding techniques. One technique uses softwareprefetch <strong>and</strong> streaming store instructions. It is discussed in the following paragraph<strong>and</strong> a code example shown in Example 9-10.The memory copy algorithm can be optimized using the Streaming SIMD Extensionswith these considerations:• Alignment of data• Proper layout of pages in memory• Cache size• Interaction of the transaction lookaside buffer (TLB) with memory accesses• Combining prefetch <strong>and</strong> streaming-store instructions.The guidelines discussed in this chapter come into play in this simple example. TLBpriming is required for the Pentium 4 processor just as it is for the Pentium IIIprocessor, since software prefetch instructions will not initiate page table walks oneither processor.9-<strong>32</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!