13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

USING PERFORMANCE MONITORING EVENTSread <strong>and</strong> an RFO look like a data bus read, <strong>and</strong> are counted as such. Further distinctionbetween programmatic reads <strong>and</strong> RFOs may be provided in future implementations.Current implementations of the BSQ_cache_reference event can suffer fromperceived over- or under-counting. <strong>Reference</strong>s are based on BSQ allocations, asdescribed above. Consequently, read misses are generally counted once per128-byte line BSQ allocation (whether one or both sectors are referenced), but read<strong>and</strong> write (RFO) hits <strong>and</strong> most write (RFO) misses are counted once per <strong>64</strong>-byte line,the size of a core reference. This makes the event counts for read misses appear tohave a 2-times overcounting with respect to read <strong>and</strong> write (RFO) hits <strong>and</strong> write(RFO) misses. This granularity mismatch cannot always be corrected for, making itdifficult to correlate to the number of programmatic misses <strong>and</strong> hits. If the userknows that both sectors in a 128 -byte line are always referenced soon after eachother, then the number of read misses can be multiplied by two to adjust miss countsto a <strong>64</strong>-byte granularity.Prefetches themselves are not counted as either hits or misses, as of Pentium 4 <strong>and</strong>Intel Xeon processors with a CPUID signature of 0xf21. However, in Pentium 4Processor implementations with a CPUID signature of 0xf07 <strong>and</strong> earlier have theproblem that reads to lines that are already being prefetched are counted as hits inaddition to misses, thus overcounting hits.The number of “Reads Non-prefetch from the Processor” is a good approximation ofthe number of outermost cache misses due to loads or RFOs, for the writebackmemory type.B.2.4Usage Notes on Bus ActivitiesA number of performance metrics in Table B-1 are based on IOQ_active_entries <strong>and</strong>BSQ_active entries. The next three paragraphs provide information of various bustransaction underway metrics. These metrics nominally measure the end-to-endlatency of transactions entering the BSQ (the aggregate sum of the allocation-todeallocationdurations for the BSQ entries used for all individual transaction in theprocessor). They can be divided by the corresponding number-of-transactionsmetrics (those that measure allocations) to approximate an average latency pertransaction. However, that approximation can be significantly higher than thenumber of cycles it takes to get the first chunk of data for the dem<strong>and</strong> fetch (load),because the entire transaction must be completed before deallocation. That latencyincludes deallocation overheads, <strong>and</strong> the time to get the other half of the 128-byteline, which is called an adjacent-sector prefetch. Since adjacent-sector prefetcheshave lower priority than dem<strong>and</strong> fetches, there is a high probability on a heavilyutilized system that the adjacent-sector prefetch will have to wait until the next busarbitration cycle from that processor. On current implementations, the granularitiesat which BSQ_allocation <strong>and</strong> BSQ_active_entries count can differ, leading to apossible 2-times overcounting of latencies for non-partial programmatic loads.Users of the bus transaction underway metrics would be best served by employingthem for relative comparisons across BSQ latencies of all transactions. Users thatB-34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!