IntelÂ® 64 and IA-32 Architectures Optimization Reference Manual

More documents

Recommendations

Info

SUMMARY OF RULES AND SUGGESTIONSCore Duo processors), and within a sector (128 bytes on Pentium 4 and Intel Xeonprocessors) .......................................................................................7-21User/Source Coding Rule 24. (M impact, ML generality) Place eachsynchronization variable alone, separated by 128 bytes or in a separate cacheline. .................................................................................................7-22User/Source Coding Rule 25. (H impact, L generality) Do not place any spinlock variable to span a cache line boundary ...........................................7-22User/Source Coding Rule 26. (M impact, H generality) Improve data and codelocality to conserve bus command bandwidth. ........................................7-24User/Source Coding Rule 27. (M impact, L generality) Avoid excessive use ofsoftware prefetch instructions and allow automatic hardware prefetcher to work.Excessive use of software prefetches can significantly and unnecessarily increasebus utilization if used inappropriately. ...................................................7-25User/Source Coding Rule 28. (M impact, M generality) Consider usingoverlapping multiple back-to-back memory reads to improve effective cache misslatencies. ..........................................................................................7-26User/Source Coding Rule 29. (M impact, M generality) Consider adjusting thesequencing of memory references such that the distribution of distances ofsuccessive cache misses of the last level cache peaks towards 64 bytes. ....7-26User/Source Coding Rule 30. (M impact, M generality) Use full writetransactions to achieve higher data throughput. .....................................7-26User/Source Coding Rule 31. (H impact, H generality) Use cache blocking toimprove locality of data access. Target one quarter to one half of the cache sizewhen targeting Intel processors supporting HT Technology or target a block sizethat allow all the logical processors serviced by a cache to share that cachesimultaneously. ..................................................................................7-27User/Source Coding Rule 32. (H impact, M generality) Minimize the sharing ofdata between threads that execute on different bus agents sharing a commonbus. The situation of a platform consisting of multiple bus domains should alsominimize data sharing across bus domains ............................................7-28User/Source Coding Rule 33. (H impact, H generality) Minimize data accesspatterns that are offset by multiples of 64 KBytes in each thread. .............7-30User/Source Coding Rule 34. (H impact, M generality) Adjust the private stackof each thread in an application so that the spacing between these stacks is notoffset by multiples of 64 KBytes or 1 MByte to prevent unnecessary cache lineevictions (when using Intel processors supporting HT Technology). ...........7-31User/Source Coding Rule 35. (M impact, L generality) Add per-instance stackoffset when two instances of the same application are executing in lock steps toE-10
SUMMARY OF RULES AND SUGGESTIONSavoid memory accesses that are offset by multiples of 64 KByte or 1 MByte, whentargeting Intel processors supporting HT Technology. ............................. 7-32User/Source Coding Rule 36. (M impact, L generality) Avoid excessive loopunrolling to ensure the Trace cache is operating efficiently. ...................... 7-34User/Source Coding Rule 37. (L impact, L generality) Optimize code size toimprove locality of Trace cache and increase delivered trace length .......... 7-34User/Source Coding Rule 38. (M impact, L generality) Consider using threadaffinity to optimize sharing resources cooperatively in the same core andsubscribing dedicated resource in separate processor cores. .................... 7-37User/Source Coding Rule 39. (M impact, L generality) If a single threadconsumes half of the peak bandwidth of a specific execution unit (e.g. fdiv),consider adding a thread that seldom or rarely relies on that execution unit, whentuning for HT Technology .................................................................... 7-43E.3 TUNING SUGGESTIONSTuning Suggestion 1. In rare cases, a performance problem may be caused byexecuting data on a code page as instructions. This is very likely to happen whenexecution is following an indirect branch that is not resident in the trace cache. Ifthis is clearly causing a performance problem, try moving the data elsewhere, orinserting an illegal opcode or a pause instruction immediately after the indirectbranch. Note that the latter two alternatives may degrade performance in somecircumstances. .................................................................................. 3-63Tuning Suggestion 2. ...........If a load is found to miss frequently, either insert aprefetch before it or (if issue bandwidth is a concern) move the load up to executeearlier. ............................................................................................. 3-70Tuning Suggestion 3. Optimize single threaded code to maximize executionthroughput first. ................................................................................ 7-41Tuning Suggestion 4. Optimize multithreaded applications to achieve optimalprocessor scaling with respect to the number of physical processors or processorcores. ............................................................................................... 7-41Tuning Suggestion 5. Schedule threads that compete for the same executionresource to separate processor cores. ................................................... 7-41Tuning Suggestion 6. Use on-chip execution resources cooperatively if two logicalprocessors are sharing the execution resources in the same processorcore. ................................................................................................ 7-42Tuning Suggestion 7.E-11
Page 1 and 2:
NIntel® 64 and IA-32 Architectures
Page 4 and 5:
CONTENTS2.3.1 Front End. . . . . .
Page 6 and 7:
CONTENTS3.7 PREFETCHING . . . . . .
Page 8 and 9:
CONTENTS5.7.2.1 Increasing Memory B
Page 10 and 11:
CONTENTS9.5.3 Streaming Store Instr
Page 13 and 14:
CONTENTSB.7.4.3B.7.4.4B.7.4.5B.7.4.
Page 15 and 16:
CONTENTSExample 4-1. Identification
Page 17 and 18:
CONTENTSExample 9-8. Using HW Prefe
Page 19 and 20:
CONTENTSFigure 9-7. Cache Blocking
Page 21 and 22:
CHAPTER 1INTRODUCTIONThe Intel® 64
Page 23 and 24:
INTRODUCTION• Chapter 10: Power O
Page 25 and 26:
CHAPTER 2INTEL ® 64 AND IA-32 PROC
Page 27 and 28:
INTEL® 64 AND IA-32 PROCESSOR ARCH
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
CHAPTER 3GENERAL OPTIMIZATION GUIDE
Page 77 and 78:
GENERAL OPTIMIZATION GUIDELINESThe
Page 79 and 80:
GENERAL OPTIMIZATION GUIDELINEScomp
Page 81 and 82:
GENERAL OPTIMIZATION GUIDELINES•
Page 83 and 84:
GENERAL OPTIMIZATION GUIDELINESExam
Page 85 and 86:
GENERAL OPTIMIZATION GUIDELINESthro
Page 87 and 88:
GENERAL OPTIMIZATION GUIDELINESAsse
Page 90 and 91:
GENERAL OPTIMIZATION GUIDELINESpred
Page 92 and 93:
GENERAL OPTIMIZATION GUIDELINESIn t
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
GENERAL OPTIMIZATION GUIDELINESBeca
Page 100 and 101:
GENERAL OPTIMIZATION GUIDELINESInst
Page 102 and 103:
GENERAL OPTIMIZATION GUIDELINESXOR
Page 104 and 105:
GENERAL OPTIMIZATION GUIDELINESUse
Page 106 and 107:
GENERAL OPTIMIZATION GUIDELINESvalu
Page 108 and 109:
Page 110 and 111:
GENERAL OPTIMIZATION GUIDELINESMOVS
Page 112 and 113:
Page 114 and 115:
GENERAL OPTIMIZATION GUIDELINESPack
Page 116 and 117:
GENERAL OPTIMIZATION GUIDELINES3.5.
Page 118 and 119:
Page 120 and 121:
GENERAL OPTIMIZATION GUIDELINESFor
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
GENERAL OPTIMIZATION GUIDELINESAs a
Page 130 and 131:
GENERAL OPTIMIZATION GUIDELINESTher
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
GENERAL OPTIMIZATION GUIDELINES—
Page 138 and 139:
Page 140 and 141:
GENERAL OPTIMIZATION GUIDELINESmore
Page 142 and 143:
GENERAL OPTIMIZATION GUIDELINESrequ
Page 144 and 145:
GENERAL OPTIMIZATION GUIDELINESmiss
Page 146 and 147:
GENERAL OPTIMIZATION GUIDELINESpref
Page 148 and 149:
GENERAL OPTIMIZATION GUIDELINESpref
Page 150 and 151:
GENERAL OPTIMIZATION GUIDELINES—
Page 152 and 153:
Page 154 and 155:
GENERAL OPTIMIZATION GUIDELINESTher
Page 156 and 157:
GENERAL OPTIMIZATION GUIDELINESstat
Page 158 and 159:
GENERAL OPTIMIZATION GUIDELINESexte
Page 160 and 161:
GENERAL OPTIMIZATION GUIDELINESFor
Page 162 and 163:
CODING FOR SIMD ARCHITECTURES4.1.1
Page 164 and 165:
CODING FOR SIMD ARCHITECTURESSoftwa
Page 166 and 167:
CODING FOR SIMD ARCHITECTURESTo use
Page 168 and 169:
Page 170 and 171:
CODING FOR SIMD ARCHITECTURESExampl
Page 172 and 173:
CODING FOR SIMD ARCHITECTURESmance
Page 174 and 175:
CODING FOR SIMD ARCHITECTURESpurpos
Page 176 and 177:
Page 178 and 179:
CODING FOR SIMD ARCHITECTURESIf the
Page 180 and 181:
CODING FOR SIMD ARCHITECTURESExampl
Page 182 and 183:
CODING FOR SIMD ARCHITECTURES• Us
Page 184 and 185:
CODING FOR SIMD ARCHITECTURESpatter
Page 186 and 187:
CODING FOR SIMD ARCHITECTURESmoves
Page 188 and 189:
CODING FOR SIMD ARCHITECTURES4-28
Page 190 and 191:
OPTIMIZING FOR SIMD INTEGER APPLICA
Page 192 and 193:
Page 194 and 195:
Page 196 and 197:
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
Page 210 and 211:
Page 212 and 213:
Page 214 and 215:
Page 216 and 217:
Page 218 and 219:
Page 220 and 221:
Page 222 and 223:
Page 224 and 225:
Page 226 and 227:
Page 228 and 229:
OPTIMIZING FOR SIMD FLOATING-POINT
Page 230 and 231:
Page 232 and 233:
Page 234 and 235:
Page 236 and 237:
Page 238 and 239:
Page 240 and 241:
Page 242 and 243:
Page 244 and 245:
Page 246 and 247:
Page 248 and 249:
Page 250 and 251:
MULTICORE AND HYPER-THREADING TECHN
Page 252 and 253:
Page 254 and 255:
Page 256 and 257:
Page 258 and 259:
Page 260 and 261:
Page 262 and 263:
Page 264 and 265:
Page 266 and 267:
Page 268 and 269:
Page 270 and 271:
Page 272 and 273:
Page 274 and 275:
Page 276 and 277:
Page 278 and 279:
Page 280 and 281:
Page 282 and 283:
Page 284 and 285:
Page 286 and 287:
Page 288 and 289:
Page 290 and 291:
Page 292 and 293:
Page 294 and 295:
OPTIMIZING CACHE USAGE• Take adva
Page 296 and 297:
OPTIMIZING CACHE USAGEHardware pref
Page 298 and 299:
OPTIMIZING CACHE USAGEThe non-tempo
Page 300 and 301:
OPTIMIZING CACHE USAGE9.5.1.3 Memor
Page 302 and 303:
OPTIMIZING CACHE USAGENOTEFailure t
Page 304 and 305:
OPTIMIZING CACHE USAGE9.5.5 CLFLUSH
Page 306 and 307:
OPTIMIZING CACHE USAGE• The hardw
Page 308 and 309:
OPTIMIZING CACHE USAGETimeExecution
Page 310 and 311:
OPTIMIZING CACHE USAGEschedule pref
Page 312 and 313:
OPTIMIZING CACHE USAGEtime is large
Page 314 and 315:
OPTIMIZING CACHE USAGEresource stal
Page 316 and 317:
OPTIMIZING CACHE USAGEPrefetchntaDa
Page 318 and 319:
OPTIMIZING CACHE USAGEExample 9-7.
Page 320 and 321:
OPTIMIZING CACHE USAGEA characteris
Page 322 and 323:
OPTIMIZING CACHE USAGE9.7 MEMORY OP
Page 324 and 325:
OPTIMIZING CACHE USAGE9.7.2.3 Concl
Page 326 and 327:
OPTIMIZING CACHE USAGEis not reside
Page 328 and 329:
OPTIMIZING CACHE USAGEExample 9-11.
Page 330 and 331:
OPTIMIZING CACHE USAGETable 9-3. De
Page 332 and 333:
OPTIMIZING CACHE USAGE• If a proc
Page 334 and 335:
64-BIT MODE CODING GUIDELINESAssemb
Page 336 and 337:
64-BIT MODE CODING GUIDELINESinstru
Page 338 and 339:
64-BIT MODE CODING GUIDELINES9-6
Page 340 and 341:
POWER OPTIMIZATION FOR MOBILE USAGE
Page 342 and 343:
Page 344 and 345:
Page 346 and 347:
Page 348 and 349:
Page 350 and 351:
Page 352 and 353:
Page 354 and 355:
APPLICATION PERFORMANCE TOOLSThe /f
Page 356 and 357:
APPLICATION PERFORMANCE TOOLSTable
Page 358 and 359:
APPLICATION PERFORMANCE TOOLSA.1.6.
Page 360 and 361:
APPLICATION PERFORMANCE TOOLSExampl
Page 362 and 363:
APPLICATION PERFORMANCE TOOLSExampl
Page 364 and 365:
APPLICATION PERFORMANCE TOOLSprogra
Page 366 and 367:
APPLICATION PERFORMANCE TOOLSThe Li
Page 368 and 369:
APPLICATION PERFORMANCE TOOLSA.5 IN
Page 370 and 371:
USING PERFORMANCE MONITORING EVENTS
Page 372 and 373:
Page 374 and 375:
Page 376 and 377:
Page 378 and 379:
Page 380 and 381:
Page 382 and 383:
Page 384 and 385:
Page 386 and 387:
Page 388 and 389:
Page 390 and 391:
Page 392 and 393:
Page 394 and 395:
Page 396 and 397:
Page 398 and 399:
Page 400 and 401:
Page 402 and 403:
Page 404 and 405:
Page 406 and 407:
Page 408 and 409:
Page 410 and 411:
Page 412 and 413:
Page 414 and 415:
Page 416 and 417:
Page 418 and 419:
Page 420 and 421:
Page 422 and 423:
Page 424 and 425:
Page 426 and 427: USING PERFORMANCE MONITORING EVENTS
Page 432 and 433: INSTRUCTION LATENCY AND THROUGHPUTW
Page 434 and 435: INSTRUCTION LATENCY AND THROUGHPUTo
Page 436 and 437: INSTRUCTION LATENCY AND THROUGHPUTT
Page 456 and 457: INSTRUCTION LATENCY AND THROUGHPUT
Page 458 and 459: INSTRUCTION LATENCY AND THROUGHPUTC
Page 460 and 461: STACK ALIGNMENTThe solution to this
Page 462 and 463: STACK ALIGNMENTExample D-1. Aligned
Page 464 and 465: STACK ALIGNMENTExample D-2. Aligned
Page 466 and 467: STACK ALIGNMENTD-8
Page 468 and 469: SUMMARY OF RULES AND SUGGESTIONSret
Page 470 and 471: SUMMARY OF RULES AND SUGGESTIONSfou
Page 472 and 473: SUMMARY OF RULES AND SUGGESTIONSvar
Page 474 and 475: SUMMARY OF RULES AND SUGGESTIONSto
Page 478 and 479: SUMMARY OF RULES AND SUGGESTIONSE-1
Page 480 and 481: INDEXsummary of rules, E-1tuning hi
Page 482 and 483: INDEXevent ratios, B-50execution co
Page 484 and 485: INDEXprefetching, 8-5SFENCE instruc
Page 486 and 487: INDEXPMAXSW, 5-28PMAXUB, 5-28PMINSW
Page 488 and 489: INDEXIndex-10
Page 490: Intel Corp.999 CANADA PLACE,Suite 4
show all

IntelÂ® 64 and IA-32 Architectures Optimization Reference Manual

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?