Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...

Freescale Semiconductor 

Application Note 

Decorated Operations for QorIQ 

P3/P4/P5 Processors 

by Networking and Multimedia Group 

Freescale Semiconductor, Inc. 

Austin, TX 

This application note presents the concept of “statistics 

acceleration” implemented with the usage of decorated 

operations. 

The important tasks of statistics gathering and logging of 

ongoing activity within an embedded system often consumes 

a substantial amount of cycles. This can be seen in 

applications such as computer vision and vehicle control, as 

well as network and telecom infrastructure. In the latter case, 

individual flows of data must be tracked, and this 

information is then used to find errors and tune the network 

to optimal performance. Although this is an important 

functionality, it is essential that statistics handling take as 

little time as possible. 

With the current multicore trend, time-efficient statistics 

handling is becoming more difficult. This is due to data 

structures with statistics or other key parameters being 

shared between the cores, and locks must be put around them 

to prevent race-conditions. This can cause well-known 

problems, such as dead-locks, live-locks, and priority 

inversion, and an even higher layer of complexity must be 

introduced. 

© 2010 Freescale Semiconductor, Inc. All rights reserved. 

Freescale Confidential Proprietary 

Preliminary—Subject to Change Without Notice 

Document Number: AN4181 

Rev. A, 07/2010 

Contents 

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

2. Silicon Implementation of Decorated Operations . . . 4 

3. Using Decorations in Software 

and Performance Results . . . . . . . . . . . . . . . . . . . . . . 6 

4. Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 7 

5. Sample Application . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

6. Decorated Macro Functions . . . . . . . . . . . . . . . . . . . 13 

7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

8. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

9. Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Introduction 

Statistics acceleration with the use of decorated operations solves the initial problem without the need for 

locks. This drastically increases performance and lowers the software complexity, because all protection 

for issues related to locks can be removed. 

1 Introduction 

Currently, the industry is rapidly migrating to multicore solutions, largely due to the fact that easy 

single-core performance improvements using stronger or faster cores is coming to an end. Further 

improvements to code execution (instructions per cycle) introduce drastically more complex logic. 

Increasing the frequency is difficult, because power consumption increases to the power of two relative 

the frequency1 . Furthermore, higher frequency gives little additional performance due to the core/memory 

speed difference [1]. 

Multicore solutions give a theoretically higher performance with low aggregate power consumption, but 

it is crucial that the hardware and software is designed to allow for efficient scaling. The most important 

hardware aspects are buses and memory interfaces; in this area, the concept of switch fabrics is replacing 

traditional buses. For example, Freescale’s P4080 communication processor, equipped with eight e500mc 

Power Architecture® cores, solves this problem by utilizing the CoreNet coherency fabric with nearly 1 

Tbps of internal memory bandwidth and dual DDR3 interfaces. The other aspect, software design, is 

difficult to solve on a general basis to allow efficient scaling. Amdahl’s law [2] describes the application 

speed-up relative to the number of cores and how well parallelized the software is. As shown in Figure 1, 

software that is largely sequential can never make efficient use of highly parallel architectures. Therefore, 

it is critical to provide means to remove sequential sections. 

Figure 1 shows Amdahl’s law of scaling over multiple cores for different degrees of parallelized code. S 

marks the portion of sequential code. 

Figure 1. Amdahl’s Law 

1. Relation is P = CV 2 F, but higher frequency requires higher voltage and leakier processes. 

Decorated Operations for QorIQ P3/P4/P5 Processors, Rev. A 

2 Freescale Confidential Proprietary Freescale Semiconductor 

Preliminary—Subject to Change Without Notice


Introduction 

Single-core applications typically work by reading data from a data structure, computing a result, and then 

writing that back to the data structure. When this application is parallelized, the same computation is done, 

but it is now important to protect the data structure so that it is not incorrectly updated due to the 

concurrency. 

Take bank software as an example. Mr Foo’s account currently holds $1000 and is now accessed by two 

processes at nearly the same time. One inserts $500 and the other deducts $20. Both processes read the 

current statement, add or remove money, and finally write the result back the account statement. If the read 

operations took place before either write, then the final result is the same as the last write operation, and 

the other operation will not come through. Mr. Foo will be left with either the full $1500 or only $980 in 

his account. 

Figure 2 shows Mr. Foo’s bank account. The final value of the account is non-deterministic. 

Figure 2. Mr. Foo’s Bank Account 

This type of issue is called a race condition, and its traditional solution is to introduce a lock on the data 

structure. However, locks are sequential by nature and lower the degree of parallelism. A lock is typically 

implemented as follows: 

Test/Set to get lock for structure 

If (lock was already used by other) then try above again 

Release lock 

Read and update structure 

Locks also introduce additional software complexity, because if priority inversion [3] and dead-lock 

situations [4] occur, they must be handled. This in turn can lead to live-locks [5]. To conclude, the 

traditional method of handling synchronization by using locks is not robust because it does not allow for 

efficient scaling. It is also very costly in terms of cycles. In a benchmark running bare-board on the P4080 

and utilizing Freescale’s light-weight executive (LWE) library, a shared variable protected by locks took 

nearly 25 times as long to update compared to a private variable 1 due to the lock overhead alone. In the 

1. Declared with “volatile” to ensure that no unfair compiler optimizations were used, and that a full read-update-write cycle was 

executed 

Freescale Semiconductor Freescale Confidential Proprietary 3 


Silicon Implementation of Decorated Operations 

case where seven cores tried to update the same shared variable multiple times and hence also had to wait 

for the lock to be freed, the update took on average 95 times longer than the private variable. See Section 3, 

“Using Decorations in Software and Performance Results,” for urther details on the benchmark. 

Freescale’s implementation of decorated operations goes back to the initial, most basic problem, namely, 

how to share data in a multicore environment. Freescale’s decorated operations solve this problem by 

means that do not introduce sequential code and have a low software overhead and complexity. This is 

done by using other parts of the SoC besides the cores to perform relative operations on data, such as 

“increase x by 10.” By moving the operation out from the core into a central location, it is possible to 

guarantee atomic operations and simpler inter-core order of execution. 

We will discuss how Decorated Operations are implemented in Freescale’s high-end QorIQ processor 

family with the P3, P4 and P5 devices and how these operations can be made use of by software in order 

to reach performance numbers that are equal to operations on private variables. Use-cases and examples 

will mainly be taken from the P4080 but is generally applicable for all the P3, P4 and P5 devices. 

2 Silicon Implementation of Decorated Operations 

Decorated operations, or decorated storage, as it is also commonly named, is a set of core instructions 

added to the Power Architecture instruction set [6]. The instructions are “decorated” with a computation 

and attribute to the common load/store instruction. For example, the stbx (Store Byte Indexed) now also 

has a decorated version: stbdx (Store Byte Decorated Indexed). This is also applied to half-word, word, 

double-word, and double-float versions of the store as well as load instructions (that is, stbdx, sthdx, 

stwdx, stddx 1 , and stfddx, and lbdx, lhdx, lwdx, lddx 2 , and lfddx). An additional dsn (Notify) instruction 

has also been added that does not have any corresponding load/store version, but is interpreted as a nop 

(No Operation by the core) and carries a decoration. 

This decoration does not have any direct meaning to the core itself, but depending on the SoC 

implementation, it is interpreted by other parts of the device. In the case of Freescale’s QorIQ processor 

family, these decorations are interpreted by the CPC, which carries out the operations together with the 

CoreNet DDR queue and DDR controller (see Figure 3). These act similarly to transactional memory [7] 

to perform operations on a global scale outside the cores. Unlike transactional memory, there is no need to 

handle rollbacks, because CoreNet buffer transactions are required and ensure the correct order of 

execution. The decorations for load instructions include clear, set, decrement, and increment of data. Store 

instructions include accumulate (could be negative), combined increment and accumulate, maximum 

threshold, and minimum threshold. The notify instruction can carry increment as well as clear operations. 

Versions are available for signed and unsigned data, but also 32- and 64-bit-word lengths. 

1. Declared with “volatile” to ensure that no unfair compiler optimizations were used, and that a full read-update-write cycle was 

executed. 

2. Not implemented on P3/P4, but available on P5. 





Silicon Implementation of Decorated Operations 

Figure 3 shows how the CoreNet platform cache and DDR controller interface. Decorated operations are 

implemented in the core as instructions, and the decorations are interpretated by the CoreNet platform 

cache (CPC), which carry out the operations together with the CoreNet DDR queue and DDR controller. 

Figure 3. Decorated Operations, CoreNet Cache, and DDR 

A decorated operation carries four parameters: type of access (such as load, store, or notify), data address 

to operate on, data to use, and the decorated value that defines the operation. 

As an example, the following decorated operation is executed: add 10 (integer) to a variable memory of 

type long (64-bit) of variable bar at address A (double-word aligned). The access type is Store Word 

Decorated (stwdx), the data address is set to A + 4 to force right justification of the store data within the 

accumulator, and the decoration type is set to Accumulate 64-bit. The following C code corresponds to this 

operation: 

decorated_store_64_acc_64(&bar,10); 

Going back to Mr. Foo’s bank account, this type of relative change is the perfect match for decorated 

operations. The two processes that work on the bank account do not need to use any locks but can simply 

execute the following, respective, instructions: 

decorated_store_64_acc_64(&account_foo, -20); 

decorated_store_64_acc_64(&account_foo, 500); 

Note that the order of execution is not important; the change is relative to the current value. This works 

well with statistics and data logging, such as keeping track of how much data and packets a specific user 

has sent in a network, the distance a car has travelled, progress measurement, and so on. A specific change 

such as updating the MAC address in an ARP table, or changing Mr Foo’s account to be owned by 

someone else, does not work well. Such abrupt changes require a larger level of synchronization between 

the processes to ensure that there are no pending transactions. 

Furthermore, the data that is operated on must be marked as cache-inhibited to not be cached by private 

L1 and L2 caches. It must also be marked as guarded so that there are no speculative loads causing 

undesired effects. The operation is carried out in the L3 platform cache and data either remains there for 

the time being, or alternatively, is brought in from DDR, updated, and directly put into the DDR write 

queue without altering the cache. The store and notify instruction is carried out directly by the core without 



Using Decorations in Software and Performance Results 

the need for the final operation to be executed in the CPC, that is, “Fire and Forget.” They are therefore be 

executed in one cycle or fewer 1 . The performance is comparably high relative to lock-based approaches 

(see Section 3, “Using Decorations in Software and Performance Results”). 

3 Using Decorations in Software and Performance 

Results 

Decorated operations are typically implemented with macros to simplify usage; alternatively, they could 

overload basic add/subtract functions for applicable programming language such as C++. In the following 

benchmark case, the operation is implemented in bare-metal directly on the P4080 without an underlying 

operating system. Seven of the eight cores are running bare-metal, whereas the last core is running Linux 

to simplify the boot process. However, the operating system configuration and at what level the decorated 

operations are implemented are not important, because they are executed on the same privilege level and 

have the same characteristics for the core as normal load/store operations. Tests are made both in 

single-core as well as multicore configurations. 

There are three required areas for data accessed by a decorated operation. First, a pointer to the data must 

be defined, as follows: 

volatile int32_t *decorated_counter = NULL; 

In the program code, allocate the data and set the value to a default state, in this case zero, as follows: 

decorated_counter=(int32_t *) stats_memalign(CACHE_LINE_SIZE, 

sizeof(int32_t)); 

*decorated_counter = 0; 

Finally, the code makes use of the data by executing a decorated operation, as follows: 

decorated_notify_inc_32(decorated_counter); 

The typical use-case for decorated operations is to update a data structure that occurs relatively seldomly, 

approximately less than every hundred cycle. In this case, an update is executed in a single cycle, which 

is the same as it is for private data. For a lock-based update, the programmer gets roughly 35 cycles in the 

ideal single-core case. These tests were measured by reading the clock cycle timer, running the test, 

reading cycle timer again, and then removing a measured overhead for reading the timers. The overhead 

is at a stable 4 clock cycles: 

atb_start = mfspr(SPR_ATBL); //start timer 


atb_stop = mfspr(SPR_ATBL); //stop timer 

Because locks use an SoC-wide atomic function, they are affected by other locks. For example, when one 

core runs the code (above) and the other cores wait at a different lock, the cycle count increases from 

roughly 35 cycles to about 200 cycles. When all cores operate on the same lock, there is additional cycle 

count increase. A synthetic use-case that is not typically found in real applications, but has general interest 

due to the extensive load it puts on the system, is to run a long loop of updates. This also allows for 

1. The e500mc core is superscalar and can load and retire up to two instructions per cycle under certain conditions. 





Implementation Details 

measuring the penalty due to multicore access to the same data. Below is an example of the code that is 

used, this time showing a lock-based access. Each core runs the loop 10,000 times: 


for(i=0; i < 10000; i++){ 

} 

spin_lock(&sync_lock); 

lock_counter++; 

spin_unlock(&sync_lock); 


In the case of lock-based accesses with eight cores running in parallel, the average access time increases 

to 848 cycles due to the delay at the lock. Note that the standard deviation is very large in this case, nearly 

50% of the average cycle count, and the access time is highly undeterministic. For decorated operations, 

the CPC is expected to become a bottle neck, because it is not designed to handle this large flow of 

consecutive operations. The CPC runs on the SoC clock rather than the core clock and can execute one 

decorated operation every second clock cycle. With a 1:2.4 ratio between core/SoC clock and seven cores 

executing decorated operations, 33 core clock cycles per iteration is expected. The benchmark confirms 

this and the standard deviation is now only 8% of the cycle count. 

Freescale implemented an application typical test (Network Address and Port Translation—NAPT) to 

measure the total impact of running with locks as well as with decorated operations. The core fetched an 

incoming UDP/IP packet from the network, did a look-up for address translation, changed destination port 

and address, updated statistics, and sent out the packet. The interesting part in this case is the statistics that 

were updated and the time it consumed. 

Without any statistics, but using of the P4080 packet processing accelerators, the cycle count per packet 

was measured to be 440 cycles with a standard deviation of 18 cycles. A global total packet and total byte 

counter were added as well as individual flow-based counters for number of packets and number of bytes 

transferred. A single lock was used to protect the statistics, and the average packet processing increased to 

686 ± 18 cycles with a lock-based approach. In this case, Freescale used decorated operations and could 

schedule the statistics updated to optimize the performance, and the total cycle count only increased to 442 

± 19 cycles per packet. 

The conclusion from the tests is that decorated operations allow for a significant performance increase 

compared to lock-based implementations. 

4 Implementation Details 

Decorated storage operations operate only on addresses that have been marked as Caching Inhibited 1 , that 

is, non-cacheable. Performing a decorated storage operation to addresses that are cacheable causes the 

operation to degrade to the equivalent non-decorated load or store operation: lbdx into lbx, stwdx into 

stwx, and notify into nop. 

1. Caching-inhibited: All loads and stores to the page bypass the caches and are performed directly to main memory. A read or 

write to a caching-inhibited page affects only the memory element specified by the operation. 



Implementation Details 

Addresses to which decorated loads are performed should be marked Guarded 1 , that is, there is no 

speculative execution allowed for those instructions. If guarded is not set, then speculative execution, for 

example, of a load operation triggers data updated. This is not problematic if the speculation turns out to 

be correct. However, if it is not, the case and the load are thrown out from the core pipeline, but the 

decoration is still executed in the memory subsystem. This in turn results in an incorrect value of the data. 

Variables (that is, accumulators) affected by decorated operations should be naturally aligned to their 

variable size (for example, word should be 4-byte-aligned). An error here can result in incorrect data 

changes, both to the variable operated on and adjacent data. 

Decorated load, store, and notify operations behave the same as normal load and store operations in all 

other aspects, such as Access control, Debug event, Storage attributes, and Alignment and memory access 

ordering. In other words, there is no difference between decorated operations and normal operations when 

it comes to application usage. Any application can use them without any OS kernel or Hypervisor 

interaction or permission. 

4.1 Load—Memory Loaded to Core Register with Decoration Result 

For decorated load operations, the processor performs a load operation with the specified decoration to the 

given address and places the data provided by the device in the target register. The different operations are 

as follows: 

8-/16-/32-/64-bit Clear 

8-/16-/32-/64-bit Set 

8-/16-/32-/64-bit Decrement 

8-/16-/32-/64-bit Increment 

4.2 Store—Core Register Stored in Memory with Result from 

Decoration 

For decorated store operations, the processor performs a store operation with the specified decoration to 

the given address and provides the data specified in the source register to the device. The different 

operations are as follows: 

32-/64-bit accumulate 

32-/64-bit increment and 32/64-bit accumulate 

64-bit maximum threshold with unsigned double word 

32-bit maximum threshold with unsigned word 

64-bit minimum threshold with unsigned double word 

32-bit minimum threshold with unsigned word 

1. Guarded: All loads and stores to this page are performed without speculation. That is, they are known to be required. 





Sample Application 

Note that the increment and accumulate decoration performs two operations but only takes one decorated 

value and also only one effective address. The first operation is an increment by 1, and therefore does not 

need a decorated value; this is instead used for the accumulate operation. The effective address points to a 

struct with the first 32-/64-bit value used for the increment, and the following 32-/64-bit value is used for 

the accumulation, see below. The usage of this is, for example, to update statistics in a dataflow, number 

of packages, and number of bytes with just one operation. 

struct stat32_pair_t { 

}; 

int32_t inc; 

int32_t acc; 

4.3 Notify—Decoration Performed on Data in Memory 

A notify instruction is an NOP (No Operation) instruction that does not have any effect on general-purpose 

registers in the core. The different operations are as follows: 

32-/64-bit increment 

32-/64-bit clear 

5 Sample Application 

#include 

#include 

#include 

#include 

__PERCPU uint32_t atb_start, atb_stop; 

__PERCPU uint32_t atb_oh; 

/** Master LWE core does required initialization first */ 

volatile uint32_t g_ctrl_lwe = INV_LWE_ID; 

__PERCPU uint32_t curr_lwe_id = 0; /**< LWE ID for each core */ 

uint32_t sync_lock; 

uint32_t init_lock; 

struct lwe_barrier sync_barrier; 




volatile int32_t *decorated_counter = NULL; 

volatile int32_t lock_counter = 0; 

__PERCPU volatile int32_t private_counter = 0; 

void singlecore_test(void) 

{ 

uint32_t i; 



atb_oh = atb_stop - atb_start; 

APP_INFO ("Start/Stop overhead is %d cycles.", atb_oh); 

atb_oh=4; 




APP_INFO ("1 Decoration took %d cycles.", atb_stop-atb_start - atb_oh); 




} 


APP_INFO ("10 Decorations took %d cycles.", atb_stop-atb_start - atb_oh); 




lock_counter+=i; 


} 


APP_INFO ("1 lock counter took %d cycles.", atb_stop-atb_start - atb_oh); 




} 






} 



void multicore_test(void) 

{ 

} 

uint32_t i; 

if (unlikely(barrier_sync(&sync_barrier) < 0)) 

LWE_PANIC("barrier sync failed!"); 




APP_INFO ("1 Decoration took %d cycles.", atb_stop-atb_start - atb_oh); 






} 








int main(int argc, char *argv[]) 

{ 

uint32_t i; 

curr_lwe_id = get_lwe_id(); 

spin_lock(&init_lock); 

if (g_ctrl_lwe == INV_LWE_ID){ 

g_ctrl_lwe = curr_lwe_id; 

} 

else{ 

APP_INFO("*********************************************"); 

APP_INFO("Decorated Operations Benchmark, July 2010"); 

APP_INFO("Jonas Svennebring, Freescale Nordic"); 

APP_INFO ("Parition %d\n\n", curr_lwe_id); 

i = barrier_init(&sync_barrier, get_online_core_mask()); 

if (unlikely(i != 0)) { 

APP_ERROR("Barrier initialization failed"); 

return 1; 

} 

decorated_counter = (int32_t *) stats_memalign(CACHE_LINE_SIZE, sizeof(int32_t)); 

*decorated_counter = 0; 

APP_INFO(""); 

APP_INFO("Singlecore Test:"); 

singlecore_test(); 

APP_INFO("Slave Partition, id %d", curr_lwe_id); 



atb_oh = atb_stop - atb_start; 

APP_INFO ("Start/Stop overhead is %d cycles.", atb_oh); 

spin_unlock(&init_lock); 




} 

APP_INFO(""); 

APP_INFO("Multicore Test:"); 

multicore_test(); 

APP_INFO("DONE!"); 

return 0; 

6 Decorated Macro Functions 

////////////////////////////////// 

//// Load Definitions 

/////////////////////////// 

enum LOAD_DECORATION { 

}; 

LOAD_DECORATION_CLEAR = 0, 

LOAD_DECORATION_SET = 1, 

LOAD_DECORATION_DEC = 2, 

LOAD_DECORATION_INC = 3 

static inline uint8_t decorated_load_clear_8(volatile void *a){ 

} 

uint8_t r; 

enum LOAD_DECORATION d = LOAD_DECORATION_CLEAR; 

__ASM("lbdx %0, %1, %2" 

: "=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 

return r; 


Decorated Macro Functions 




static inline uint8_t decorated_load_set_8(volatile void *a){ 

uint8_t r; 

enum LOAD_DECORATION d = LOAD_DECORATION_SET; 

return r; 

} 

__ASM("lbdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 

static inline uint8_t decorated_load_dec_8(volatile void *a){ 

uint8_t r; 

enum LOAD_DECORATION d = LOAD_DECORATION_DEC; 

return r; 

} 

__ASM("lbdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 

static inline uint8_t decorated_load_inc_8(volatile void *a){ 

uint8_t r; 

enum LOAD_DECORATION d = LOAD_DECORATION_INC; 

return r; 

} 

__ASM("lbdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint16_t r; 


return r; 

} 

__ASM("lhdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 





uint16_t r; 


return r; 

} 

__ASM("lhdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint16_t r; 


return r; 

} 

__ASM("lhdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint16_t r; 


return r; 

} 

__ASM("lhdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint32_t r; 


return r; 

__ASM("lwdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 






} 


uint32_t r; 


return r; 

} 

__ASM("lwdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint32_t r; 


return r; 

} 

__ASM("lwdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint32_t r; 


return r; 

} 

__ASM("lwdx %0, %1, %2":"=r"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint64_t r; 


__ASM("lfddx %0, %1, %2":"=f"(r) 

: "r"(d), "r"(a) 

: "memory"); 




eturn r; 

} 


} 

uint64_t r; 


return r; 

__ASM("lfddx %0, %1, %2":"=f"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint64_t r; 


return r; 

} 

__ASM("lfddx %0, %1, %2":"=f"(r) 

: "r"(d), "r"(a) 

: "memory"); 


uint64_t r; 


return r; 

} 

__ASM("lfddx %0, %1, %2":"=f"(r) 

: "r"(d), "r"(a) 

: "memory"); 

////////////////////////////////// 

//// Store Definitions 

/////////////////////////// 






enum STORE_DECORATION { 

}; 

STORE_DECORATION_ACC_64 = 0, 

STORE_DECORATION_ACC_32 = 1, 

STORE_DECORATION_INC_ACC_64 = 2, 

STORE_DECORATION_INC_ACC_32 = 3 


}; 

int32_t inc; 

int32_t acc; 


}; 

int64_t inc; 

int64_t acc; 

static inline void decorated_store_32_acc_32(volatile void *a, register int32_t v){ 

} 

volatile void *address = a; 

enum STORE_DECORATION d = STORE_DECORATION_ACC_32; 

__ASM("stwdx %0, %1, %2": 

:"r"(v), "r"(d), "r"(address) 

:"memory"); 

static inline void decorated_store_32_inc_acc_32(volatile void *a, register int32_t v){ 

} 

volatile void *address = (void *) ((uintptr_t) a + 4); 

enum STORE_DECORATION d = STORE_DECORATION_INC_ACC_32; 

__ASM("stwdx %0, %1, %2": 


:"memory"); 






} 


__ASM("stwdx %0, %1, %2": 


:"memory"); 




} 



__ASM("stwdx %0, %1, %2": 


:"memory"); 


} 

volatile void *address = a; 


__ASM("stfddx %0, %1, %2": 

:"f"(v), "r"(d), "r"(address) 

:"memory"); 


} 



__ASM("stfddx %0, %1, %2": 

:"f"(v), "r"(d), "r"(address) 

:"memory"); 

////////////////////////////////// 

//// Notify Definitions 

/////////////////////////// 

enum NOTIFY_DECORATION { 

NOTIFY_DECORATION_INC_64 = 0, 




}; 

NOTIFY_DECORATION_INC_32 = 1, 

NOTIFY_DECORATION_CLEAR_64 = 2, 

NOTIFY_DECORATION_CLEAR_32 = 3 

static inline void decorated_notify_inc_32(volatile void *a){ 

} 

register enum STORE_DECORATION d = NOTIFY_DECORATION_INC_32; 

__ASM("dsn %0, %1": 

:"r"(d), "r"(a) 

:"memory"); 

static inline void decorated_notify_clear_32(volatile void *a){ 

} 

register enum STORE_DECORATION d = NOTIFY_DECORATION_CLEAR_32; 

__ASM("dsn %0, %1": 

:"r"(d), "r"(a) 

:"memory"); 

static inline void decorated_notify_inc_64(volatile void *a){ 

} 

register enum STORE_DECORATION d = NOTIFY_DECORATION_INC_64; 

__ASM("dsn %0, %1": 

:"r"(d), "r"(a) 

:"memory"); 

static inline void decorated_notify_clear_64(volatile void *a){ 

} 

register enum STORE_DECORATION d = NOTIFY_DECORATION_CLEAR_64; 

__ASM("dsn %0, %1": 

:"r"(d), "r"(a) 

:"memory"); 




7 Summary 


Summary 

Working on shared data in multicore devices poses a problem, because simultaneous access to the data 

without any protection gives rise to race-condition and undeterministic behavior. The traditional approach 

to avoid race-conditions between the cores has been to introduce locks around the shared data. However, 

locks decrease the level of parallelism (and therefore the scalability of the software) as well as raise new 

issues with both reduced performance and robustness as side effects. 

The solution described in this application note makes use of new instructions that allow a central part of 

the device to update the data. This can then be done in an atomic fashion and without core-specific 

influence. Performance can be as good as private data accesses, and performance for both synthetic 

worst-case tests as well application realistic tests are well above that of lock-based solutions. 

8 References 

Following is a list of helpful references used in this application note: 

1. Embedded Multicore: An Introduction by Jonas Svennebring, John Logan, Jakob Engblom, Patrik 

Strömblad. Freescale Semiconductor, Inc. 2009. 

2. Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities by 

Amdahl, Gene. AFIPS Conference Proceedings (30) 483–485 (1967). 

3. Experience with Processes and Monitors in Mesa by Butler W. Lampson and David D. Redell. 

CACM 23(2):105-117 (February 1980) 

4. The Deadlock problem: a classifying bibliography by Zöbel, Dieter. ACM SIGOPS Operating 

Systems Review 17 (4): 6–15. (October 1983) 

5. Eliminating receive livelock in an interrupt-driven kernel by Mogul, Jeffrey C.; K. K. 

Ramakrishnan. ACM TOCS 15 (3): 217-252 (August 1997) 

6. Freescale Book E Implementation Standards for Storage, version 0.92, 3/7/2008 

7. Transactional Memory: Architectural Support for Lock-Free Data Structures by Maurice Herlihy, 

J. Eliot B. Moss. ISCA Proceedings, 289–300 (1993). 

9 Revision History 

Table 1 provides a revision history for this application note. 

Rev. 

Number 

Table 1. Document Revision History 

Date Substantive Change(s) 

A 07/2010 Initial NDA release 



How to Reach Us: 

Home Page: 

www.freescale.com 

Web Support: 

http://www.freescale.com/support 

USA/Europe or Locations Not Listed: 

Freescale Semiconductor, Inc. 

Technical Information Center, EL516 

2100 East Elliot Road 

Tempe, Arizona 85284 

1-800-521-6274 or 

+1-480-768-2130 

www.freescale.com/support 

Europe, Middle East, and Africa: 

Freescale Halbleiter Deutschland GmbH 

Technical Information Center 

Schatzbogen 7 

81829 Muenchen, Germany 

+44 1296 380 456 (English) 

+46 8 52200080 (English) 

+49 89 92103 559 (German) 

+33 1 69 35 48 48 (French) 

www.freescale.com/support 

Japan: 

Freescale Semiconductor Japan Ltd. 

Headquarters 

ARCO Tower 15F 

1-8-1, Shimo-Meguro, Meguro-ku 

Tokyo 153-0064 

Japan 

0120 191014 or 

+81 3 5437 9125 

support.japan@freescale.com 

Asia/Pacific: 

Freescale Semiconductor China Ltd. 

Exchange Building 23F 

No. 118 Jianguo Road 

Chaoyang District 

Beijing 100022 

China 

+86 10 5879 8000 

support.asia@freescale.com 

For Literature Requests Only: 

Freescale Semiconductor 

Literature Distribution Center 

1-800 441-2447 or 

+1-303-675-2140 

Fax: +1-303-675-2150 

LDCForFreescaleSemiconductor 

@hibbertgroup.com 

Document Number: AN4181 

Rev. A 

07/2010 

Freescale Confidential Proprietary 

Preliminary—Subject to Change Without Notice 

Information in this document is provided solely to enable system and software 

implementers to use Freescale Semiconductor products. There are no express or 

implied copyright licenses granted hereunder to design or fabricate any integrated 

circuits or integrated circuits based on the information in this document. 

Freescale Semiconductor reserves the right to make changes without further notice to 

any products herein. Freescale Semiconductor makes no warranty, representation or 

guarantee regarding the suitability of its products for any particular purpose, nor does 

Freescale Semiconductor assume any liability arising out of the application or use of 

any product or circuit, and specifically disclaims any and all liability, including without 

limitation consequential or incidental damages. “Typical” parameters which may be 

provided in Freescale Semiconductor data sheets and/or specifications can and do 

vary in different applications and actual performance may vary over time. All operating 

parameters, including “Typicals” must be validated for each customer application by 

customer’s technical experts. Freescale Semiconductor does not convey any license 

under its patent rights nor the rights of others. Freescale Semiconductor products are 

not designed, intended, or authorized for use as components in systems intended for 

surgical implant into the body, or other applications intended to support or sustain life, 

or for any other application in which the failure of the Freescale Semiconductor product 

could create a situation where personal injury or death may occur. Should Buyer 

purchase or use Freescale Semiconductor products for any such unintended or 

unauthorized application, Buyer shall indemnify and hold Freescale Semiconductor 

and its officers, employees, subsidiaries, affiliates, and distributors harmless against all 

claims, costs, damages, and expenses, and reasonable attorney fees arising out of, 

directly or indirectly, any claim of personal injury or death associated with such 

unintended or unauthorized use, even if such claim alleges that Freescale 

Semiconductor was negligent regarding the design or manufacture of the part. 

Freescale and the Freescale logo are trademarks of Freescale 

Semiconductor, Inc. Reg. U.S. Pat. & Tm. Off. CoreNet and QorIQ are 

trademarks of Freescale Semiconductor, Inc. All other product or service 

names are the property of their respective owners. The Power Architecture 

and Power.org word marks and the Power and Power.org logos and related 

marks are trademarks and service marks licensed by Power.org. 

© 2010 Freescale Semiconductor, Inc.

Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...

Create successful ePaper yourself

Delete template?

Save as template?