Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...
Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...
Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Using Decorations in Software and Per<strong>for</strong>mance Results<br />
the need <strong>for</strong> the final operation to be executed in the CPC, that is, “Fire and Forget.” They are there<strong>for</strong>e be<br />
executed in one cycle or fewer 1 . The per<strong>for</strong>mance is comparably high relative to lock-based approaches<br />
(see Section 3, “Using Decorations in Software and Per<strong>for</strong>mance Results”).<br />
3 Using Decorations in Software and Per<strong>for</strong>mance<br />
Results<br />
<strong>Decorated</strong> operations are typically implemented with macros to simplify usage; alternatively, they could<br />
overload basic add/subtract functions <strong>for</strong> applicable programming language such as C++. In the following<br />
benchmark case, the operation is implemented in bare-metal directly on the <strong>P4</strong>080 without an underlying<br />
operating system. Seven of the eight cores are running bare-metal, whereas the last core is running Linux<br />
to simplify the boot process. However, the operating system configuration and at what level the decorated<br />
operations are implemented are not important, because they are executed on the same privilege level and<br />
have the same characteristics <strong>for</strong> the core as normal load/store operations. Tests are made both in<br />
single-core as well as multicore configurations.<br />
There are three required areas <strong>for</strong> data accessed by a decorated operation. First, a pointer to the data must<br />
be defined, as follows:<br />
volatile int32_t *decorated_counter = NULL;<br />
In the program code, allocate the data and set the value to a default state, in this case zero, as follows:<br />
decorated_counter=(int32_t *) stats_memalign(CACHE_LINE_SIZE,<br />
sizeof(int32_t));<br />
*decorated_counter = 0;<br />
Finally, the code makes use of the data by executing a decorated operation, as follows:<br />
decorated_notify_inc_32(decorated_counter);<br />
The typical use-case <strong>for</strong> decorated operations is to update a data structure that occurs relatively seldomly,<br />
approximately less than every hundred cycle. In this case, an update is executed in a single cycle, which<br />
is the same as it is <strong>for</strong> private data. For a lock-based update, the programmer gets roughly 35 cycles in the<br />
ideal single-core case. These tests were measured by reading the clock cycle timer, running the test,<br />
reading cycle timer again, and then removing a measured overhead <strong>for</strong> reading the timers. The overhead<br />
is at a stable 4 clock cycles:<br />
atb_start = mfspr(SPR_ATBL); //start timer<br />
decorated_notify_inc_32(decorated_counter);<br />
atb_stop = mfspr(SPR_ATBL); //stop timer<br />
Because locks use an SoC-wide atomic function, they are affected by other locks. For example, when one<br />
core runs the code (above) and the other cores wait at a different lock, the cycle count increases from<br />
roughly 35 cycles to about 200 cycles. When all cores operate on the same lock, there is additional cycle<br />
count increase. A synthetic use-case that is not typically found in real applications, but has general interest<br />
due to the extensive load it puts on the system, is to run a long loop of updates. This also allows <strong>for</strong><br />
1. The e500mc core is superscalar and can load and retire up to two instructions per cycle under certain conditions.<br />
<strong>Decorated</strong> <strong>Operations</strong> <strong>for</strong> <strong>QorIQ</strong> <strong>P3</strong>/<strong>P4</strong>/<strong>P5</strong> <strong>Processors</strong>, Rev. A<br />
6 <strong>Freescale</strong> Confidential Proprietary <strong>Freescale</strong> Semiconductor<br />
Preliminary—Subject to Change Without Notice