13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESstate that the rounding mode should be truncation. With the Pentium 4 processor,one can use the CVTTSD2SI <strong>and</strong> CVTTSS2SI instructions to convert oper<strong>and</strong>s withtruncation without ever needing to change rounding modes. The cost savings ofusing these instructions over the methods below is enough to justify using SSE <strong>and</strong>SSE2 wherever possible when truncation is involved.For x87 floating point, the FIST instruction uses the rounding mode represented inthe floating-point control word (FCW). The rounding mode is generally “round tonearest”, so many compiler writers implement a change in the rounding mode in theprocessor in order to conform to the C <strong>and</strong> FORTRAN st<strong>and</strong>ards. This implementationrequires changing the control word on the processor using the FLDCW instruction.For a change in the rounding, precision, <strong>and</strong> infinity bits, use the FSTCW instructionto store the floating-point control word. Then use the FLDCW instruction to changethe rounding mode to truncation.In a typical code sequence that changes the rounding mode in the FCW, a FSTCWinstruction is usually followed by a load operation. The load operation from memoryshould be a 16-bit oper<strong>and</strong> to prevent store-forwarding problem. If the load operationon the previously-stored FCW word involves either an 8-bit or a <strong>32</strong>-bit oper<strong>and</strong>,this will cause a store-forwarding problem due to mismatch of the size of the databetween the store operation <strong>and</strong> the load operation.To avoid store-forwarding problems, make sure that the write <strong>and</strong> read to the FCWare both 16-bit operations.If there is more than one change to the rounding, precision, <strong>and</strong> infinity bits, <strong>and</strong> therounding mode is not important to the result, use the algorithm in Example 3-45 toavoid synchronization issues, the overhead of the FLDCW instruction, <strong>and</strong> having tochange the rounding mode. Note that the example suffers from a store-forwardingproblem which will lead to a performance penalty. However, its performance is stillbetter than changing the rounding, precision, <strong>and</strong> infinity bits among more than twovalues.Example 3-45. Algorithm to Avoid Changing Rounding Mode_fto1<strong>32</strong>proclea ecx, [esp-8]sub esp, 16 ; Allocate frame<strong>and</strong> ecx, -8 ; Align pointer on boundary of 8fld st(0) ; Duplicate FPU stack topfistp qword ptr[ecx]fild qword ptr[ecx]mov edx, [ecx+4] ; High DWORD of integermov eax, [ecx] ; Low DWIRD of integertest eax, eaxje integer_QnaN_or_zero3-82

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!