13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEProcessor, CPUIDSignature <strong>and</strong>FSB SpeedPentium Dprocessor, 0xF4n,800Table 9-2. Relative Performance of Memory Copy Routines (Contd.)Byte SequentialDWORDSequentialSW prefetch + 8byte streamingstore3.4X 3.3X 4.9X 5.7X4KB-Block HWprefetch + 16byte streamingstoresThe baseline for performance comparison is the throughput (bytes/sec) of 8-MByteregion memory copy on a first-generation Pentium M processor (CPUID signature0x69n) with a 400-MHz system bus using byte-sequential technique similar to thatshown in Example 9-9. The degree of improvement relative to the performancebaseline for some recent processors <strong>and</strong> platforms with higher system bus speedusing different coding techniques are compared.The second coding technique moves data at 4-Byte granularity using REP stringinstruction. The third column compares the performance of the coding techniquelisted in Example 9-10. The fourth column of performance compares the throughputof fetching 4-KBytes of data at a time (using hardware prefetch to aggregate busread transactions) <strong>and</strong> writing to memory via 16-Byte streaming stores.Increases in bus speed is the primary contributor to throughput improvements. Thetechnique shown in Example 9-11 will likely take advantage of the faster bus speedin the platform more efficiently. Additionally, increasing the block size to multiples of4-KBytes while keeping the total working set within the second-level cache canimprove the throughput slightly.The relative performance figure shown in Table 9-2 is representative of cleanmicroarchitectural conditions within a processor (e.g. looping s simple sequence ofcode many times). The net benefit of integrating a specific memory copy routine intoan application (full-featured applications tend to create many complicated microarchitecturalconditions) will vary for each application.9.7.3 Deterministic Cache ParametersIf CPUID supports the deterministic parameter leaf, software can use the leaf toquery each level of the cache hierarchy. Enumeration of each cache level is by specifyingan index value (starting form 0) in the ECX register (see “CPUID-CPU Identification”in Chapter 3 of the Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> <strong>Architectures</strong> Software Developer’s<strong>Manual</strong>, Volume 2A).The list of parameters is shown in Table 9-3.9-37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!