13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESThe VTune Performance Analyzer also provides measures for a number of workloadcharacteristics, including:• retirement throughput of instruction execution as an indication of the degree ofextractable instruction-level parallelism in the workload• data traffic locality as an indication of the stress point of the cache <strong>and</strong> memoryhierarchy• data traffic parallelism as an indication of the degree of effectiveness of amortizationof data access latencyNOTEImproving performance in one part of the machine does notnecessarily bring significant gains to overall performance. It ispossible to degrade overall performance by improving performancefor some particular metric.Where appropriate, coding recommendations in this chapter include descriptions ofthe VTune Performance Analyzer events that provide measurable data on the performancegain achieved by following the recommendations. For more on using theVTune analyzer, refer to the application’s online help.3.2 PROCESSOR PERSPECTIVESMany coding recommendations for Intel Core microarchitecture work well acrossPentium M, Intel Core Solo, Intel Core Duo processors <strong>and</strong> processors based on IntelNetBurst microarchitecture. However, there are situations where a recommendationmay benefit one microarchitecture more than another. Some of these are:• Instruction decode throughput is important for processors based on Intel Coremicroarchitecture (Pentium M, Intel Core Solo, <strong>and</strong> Intel Core Duo processors)but less important for processors based on Intel NetBurst microarchitecture.• Generating code with a 4-1-1 template (instruction with four μops followed bytwo instructions with one μop each) helps the Pentium M processor.Intel Core Solo <strong>and</strong> Intel Core Duo processors have an enhanced front end thatis less sensitive to the 4-1-1 template. Processors based on Intel Core microarchitecturehave 4 decoders <strong>and</strong> employ micro-fusion <strong>and</strong> macro-fusion so thateach of three simple decoders are not restricted to h<strong>and</strong>ling simple instructionsconsisting of one μop.Taking advantage of micro-fusion will increase decoder throughput across IntelCore Solo, Intel Core Duo <strong>and</strong> Intel Core2 Duo processors. Taking advantage ofmacro-fusion can improve decoder throughput further on Intel Core 2 Duoprocessor family.3-3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!