03.03.2013 Views

Intel® Architecture Instruction Set Extensions Programming Reference

Intel® Architecture Instruction Set Extensions Programming Reference

Intel® Architecture Instruction Set Extensions Programming Reference

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

APPLICATION PROGRAMMING MODEL<br />

2.8.1 Clearing Upper YMM State Between AVX and Legacy SSE <strong>Instruction</strong>s<br />

There is no transition penalty if an application clears the upper bits of all YMM registers (set to ‘0’) via VZER-<br />

OUPPER, VZEROALL, before transitioning between AVX instructions and legacy SSE instructions. Note: clearing the<br />

upper state via sequences of XORPS or loading ‘0’ values individually may be useful for breaking dependency, but<br />

will not avoid state transition penalties.<br />

Example 1: an application using 256-bit AVX instructions makes calls to a library written using Legacy SSE instructions.<br />

This would encounter a delay upon executing the first Legacy SSE instruction in that library and then (after<br />

exiting the library) upon executing the first AVX instruction. To eliminate both of these delays, the user should<br />

execute the instruction VZEROUPPER prior to entering the legacy library and (after exiting the library) before<br />

executing in a 256-bit AVX code path.<br />

Example 2: a library using 256-bit AVX instructions is intended to support other applications that use legacy SSE<br />

instructions. Such a library function should execute VZEROUPPER prior to executing other VEX-encoded instructions.<br />

The library function should issue VZEROUPPER at the end of the function before it returns to the calling application.<br />

This will prevent the calling application to experience delay when it starts to execute legacy SSE code.<br />

2.8.2 Using AVX 128-bit <strong>Instruction</strong>s Instead of Legacy SSE instructions<br />

Applications using AVX and FMA should migrate legacy 128-bit SIMD instructions to their 128-bit AVX equivalents.<br />

AVX supplies the full complement of 128-bit SIMD instructions except for AES and PCLMULQDQ.<br />

2.8.3 Unaligned Memory Access and Buffer Size Management<br />

The majority of AVX instructions support loading 16/32 bytes from memory without alignment restrictions (A<br />

number non-VEX-encoded SIMD instructions also don’t require 16-byte address alignment, e.g. MOVDQU,<br />

MOVUPS, MOVUPD, LDDQU, PCMPESTRI, PCMPESTRM, PCMPISTRI and PCMPISTRM). A buffer size management<br />

issue related to unaligned SIMD memory access is discussed here.<br />

The size requirements for memory buffer allocation should consider unaligned SIMD memory semantics and application<br />

usage. Frequently a caller function may pass an address pointer in conjunction with a length parameter.<br />

From the caller perspective, the length parameter usually corresponds to the limit of the allocated memory buffer<br />

range, or it may correspond to certain application-specific configuration parameter that have indirect relationship<br />

with valid buffer size.<br />

For certain types of application usage, it may be desirable to make distinctions between valid buffer range limit<br />

versus other application specific parameters related memory access patterns, examples of the latter may be stride<br />

distance, frame dimensions, etc. There may be situations that a callee wishes to load 16-bytes of data with parts<br />

of the 16-bytes lying outside the valid memory buffer region to take advantage of the efficiency of SIMD load bandwidth<br />

and discard invalid data elements outside the buffer boundary. An example of this may be in video processing<br />

of frames having dimensions that are not modular 16 bytes.<br />

Allocating buffers without regard to the use of the subsequent 16/32 bytes can lead to the rare occurrence of<br />

access rights violation as described below:<br />

• A present page in the linear address space being used by ring 3 code is followed by a page owned by ring 0<br />

code,<br />

• A caller routine allocates a memory buffer without adding extra pad space and passes the buffer address to a<br />

callee routine,<br />

• A callee routine implements an iterative processing algorithm by advancing an address pointer relative to the<br />

buffer address using SIMD instructions with unaligned 16/32 load semantics<br />

• The callee routine may choose to load 16/32 bytes near buffer boundary with the intent to discard invalid data<br />

outside the data buffer allocated by the caller.<br />

• If the valid data buffer extends to the end of the present page, unaligned 16/32 byte loads near the end of a<br />

present page may spill over to the subsequent ring-0 page and causing a #GP.<br />

2-26 Ref. # 319433-014

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!