13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING FOR SIMD INTEGER APPLICATIONSPentium II, Pentium III, <strong>and</strong> Pentium 4 processors may stall in such situations. SeeChapter 3 for details.5.7.1 Partial Memory AccessesConsider a case with a large load after a series of small stores to the same area ofmemory (beginning at memory address MEM). The large load stalls in the caseshown in Example 5-31.Example 5-31. A Large Load after a Series of Small Stores (Penalty)mov mem, eax ; store dword to address “mem"mov mem + 4, ebx ; store dword to address “mem + 4"::movq mm0, mem ; load qword at address “mem", stallsMOVQ must wait for the stores to write memory before it can access all data itrequires. This stall can also occur with other data types (for example, when bytes orwords are stored <strong>and</strong> then words or doublewords are read from the same area ofmemory). When you change the code sequence as shown in Example 5-<strong>32</strong>, theprocessor can access the data without delay.Example 5-<strong>32</strong>. Accessing Data Without Delaymovd mm1, ebx ; build data into a qword first; before storing it to memorymovd mm2, eaxpsllq mm1, <strong>32</strong>por mm1, mm2movq mem, mm1 ; store SIMD variable to “mem" as; a qword::movq mm0, mem ; load qword SIMD “mem", no stall5-<strong>32</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!