03.08.2013 Views

Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...

Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...

Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Silicon Implementation of <strong>Decorated</strong> <strong>Operations</strong><br />

case where seven cores tried to update the same shared variable multiple times and hence also had to wait<br />

<strong>for</strong> the lock to be freed, the update took on average 95 times longer than the private variable. See Section 3,<br />

“Using Decorations in Software and Per<strong>for</strong>mance Results,” <strong>for</strong> urther details on the benchmark.<br />

<strong>Freescale</strong>’s implementation of decorated operations goes back to the initial, most basic problem, namely,<br />

how to share data in a multicore environment. <strong>Freescale</strong>’s decorated operations solve this problem by<br />

means that do not introduce sequential code and have a low software overhead and complexity. This is<br />

done by using other parts of the SoC besides the cores to per<strong>for</strong>m relative operations on data, such as<br />

“increase x by 10.” By moving the operation out from the core into a central location, it is possible to<br />

guarantee atomic operations and simpler inter-core order of execution.<br />

We will discuss how <strong>Decorated</strong> <strong>Operations</strong> are implemented in <strong>Freescale</strong>’s high-end <strong>QorIQ</strong> processor<br />

family with the <strong>P3</strong>, <strong>P4</strong> and <strong>P5</strong> devices and how these operations can be made use of by software in order<br />

to reach per<strong>for</strong>mance numbers that are equal to operations on private variables. Use-cases and examples<br />

will mainly be taken from the <strong>P4</strong>080 but is generally applicable <strong>for</strong> all the <strong>P3</strong>, <strong>P4</strong> and <strong>P5</strong> devices.<br />

2 Silicon Implementation of <strong>Decorated</strong> <strong>Operations</strong><br />

<strong>Decorated</strong> operations, or decorated storage, as it is also commonly named, is a set of core instructions<br />

added to the Power Architecture instruction set [6]. The instructions are “decorated” with a computation<br />

and attribute to the common load/store instruction. For example, the stbx (Store Byte Indexed) now also<br />

has a decorated version: stbdx (Store Byte <strong>Decorated</strong> Indexed). This is also applied to half-word, word,<br />

double-word, and double-float versions of the store as well as load instructions (that is, stbdx, sthdx,<br />

stwdx, stddx 1 , and stfddx, and lbdx, lhdx, lwdx, lddx 2 , and lfddx). An additional dsn (Notify) instruction<br />

has also been added that does not have any corresponding load/store version, but is interpreted as a nop<br />

(No Operation by the core) and carries a decoration.<br />

This decoration does not have any direct meaning to the core itself, but depending on the SoC<br />

implementation, it is interpreted by other parts of the device. In the case of <strong>Freescale</strong>’s <strong>QorIQ</strong> processor<br />

family, these decorations are interpreted by the CPC, which carries out the operations together with the<br />

CoreNet DDR queue and DDR controller (see Figure 3). These act similarly to transactional memory [7]<br />

to per<strong>for</strong>m operations on a global scale outside the cores. Unlike transactional memory, there is no need to<br />

handle rollbacks, because CoreNet buffer transactions are required and ensure the correct order of<br />

execution. The decorations <strong>for</strong> load instructions include clear, set, decrement, and increment of data. Store<br />

instructions include accumulate (could be negative), combined increment and accumulate, maximum<br />

threshold, and minimum threshold. The notify instruction can carry increment as well as clear operations.<br />

Versions are available <strong>for</strong> signed and unsigned data, but also 32- and 64-bit-word lengths.<br />

1. Declared with “volatile” to ensure that no unfair compiler optimizations were used, and that a full read-update-write cycle was<br />

executed.<br />

2. Not implemented on <strong>P3</strong>/<strong>P4</strong>, but available on <strong>P5</strong>.<br />

<strong>Decorated</strong> <strong>Operations</strong> <strong>for</strong> <strong>QorIQ</strong> <strong>P3</strong>/<strong>P4</strong>/<strong>P5</strong> <strong>Processors</strong>, Rev. A<br />

4 <strong>Freescale</strong> Confidential Proprietary <strong>Freescale</strong> Semiconductor<br />

Preliminary—Subject to Change Without Notice

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!