13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

USING PERFORMANCE MONITORING EVENTSB.2.1Trace Cache EventsThe trace cache is not directly comparable to an instruction cache. The two are organizedvery differently. For example, a trace can span many lines worth of instructioncachedata. As with most microarchitectural elements, trace cache performance isonly an issue if something else is not a bigger bottleneck. If an application is busb<strong>and</strong>width bound, the b<strong>and</strong>width that the front end is getting μops to the core maybe irrelevant. When front-end b<strong>and</strong>width is an issue, the trace cache, in delivermode, can issue μops to the core faster than either the decoder (build mode) or themicrocode store (the MS ROM). Thus, the percent of time in trace cache delivermode, or similarly, the percentage of all bogus <strong>and</strong> non-bogus μops from the tracecache can be a useful metric for determining front-end performance.The metric that is most analogous to an instruction cache miss is a trace cache miss.An unsuccessful lookup of the trace cache (colloquially, a miss) is not interesting, perse, if we are in build mode <strong>and</strong> don't find a trace available. We just keep buildingtraces. The only “penalty” in that case is that we continue to have a lower front-endb<strong>and</strong>width. The trace cache miss metric that is currently used is not just any TC miss,but rather one that is incurred while the machine is already in deliver mode (forexample: when a 15-20 cycle penalty is paid). Again, care must be exercised. A smallaverage number of TC misses per instruction does not indicate good front-endperformance if the percentage of time in deliver mode is also low.B.2.2Bus <strong>and</strong> Memory MetricsIn order to correctly interpret the observed counts of performance metrics related tobus events, it is helpful to underst<strong>and</strong> transaction sizes, when entries are allocated indifferent queues, <strong>and</strong> how sectoring <strong>and</strong> prefetching affect counts.Figure B-1 is a simplified block diagram of the sub-systems connected to the IOQunit in the front side bus sub-system <strong>and</strong> the BSQ unit that interface to the IOQ. Atwo-way SMP configuration is illustrated. 1st level cache misses <strong>and</strong> writebacks (alsocalled core references) result in references to the 2nd level cache. The Bus SequenceQueue (BSQ) holds requests from the processor core or prefetcher that are to beserviced on the front side bus (FSB), or in the local XAPIC. If a 3rd level cache ispresent on-die, the BSQ also holds writeback requests (dirty, evicted data) from the2nd level cache. The FSB's IOQ holds requests that have gone out onto the front sidebus.B-30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!