Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...

More documents

Recommendations

Info

$Set of General Math and Motor Control Functions for ARM Cortex ...$

Introduction Statistics acceleration with the use of decorated operations solves the initial problem without the need for locks. This drastically increases performance and lowers the software complexity, because all protection for issues related to locks can be removed. 1 Introduction Currently, the industry is rapidly migrating to multicore solutions, largely due to the fact that easy single-core performance improvements using stronger or faster cores is coming to an end. Further improvements to code execution (instructions per cycle) introduce drastically more complex logic. Increasing the frequency is difficult, because power consumption increases to the power of two relative the frequency1 . Furthermore, higher frequency gives little additional performance due to the core/memory speed difference [1]. Multicore solutions give a theoretically higher performance with low aggregate power consumption, but it is crucial that the hardware and software is designed to allow for efficient scaling. The most important hardware aspects are buses and memory interfaces; in this area, the concept of switch fabrics is replacing traditional buses. For example, Freescale’s P4080 communication processor, equipped with eight e500mc Power Architecture® cores, solves this problem by utilizing the CoreNet coherency fabric with nearly 1 Tbps of internal memory bandwidth and dual DDR3 interfaces. The other aspect, software design, is difficult to solve on a general basis to allow efficient scaling. Amdahl’s law [2] describes the application speed-up relative to the number of cores and how well parallelized the software is. As shown in Figure 1, software that is largely sequential can never make efficient use of highly parallel architectures. Therefore, it is critical to provide means to remove sequential sections. Figure 1 shows Amdahl’s law of scaling over multiple cores for different degrees of parallelized code. S marks the portion of sequential code. Figure 1. Amdahl’s Law 1. Relation is P = CV 2 F, but higher frequency requires higher voltage and leakier processes. Decorated Operations for QorIQ P3/P4/P5 Processors, Rev. A 2 Freescale Confidential Proprietary Freescale Semiconductor Preliminary—Subject to Change Without Notice
Decorated Operations for QorIQ P3/P4/P5 Processors, Rev. A Introduction Single-core applications typically work by reading data from a data structure, computing a result, and then writing that back to the data structure. When this application is parallelized, the same computation is done, but it is now important to protect the data structure so that it is not incorrectly updated due to the concurrency. Take bank software as an example. Mr Foo’s account currently holds $1000 and is now accessed by two processes at nearly the same time. One inserts $500 and the other deducts $20. Both processes read the current statement, add or remove money, and finally write the result back the account statement. If the read operations took place before either write, then the final result is the same as the last write operation, and the other operation will not come through. Mr. Foo will be left with either the full $1500 or only $980 in his account. Figure 2 shows Mr. Foo’s bank account. The final value of the account is non-deterministic. Figure 2. Mr. Foo’s Bank Account This type of issue is called a race condition, and its traditional solution is to introduce a lock on the data structure. However, locks are sequential by nature and lower the degree of parallelism. A lock is typically implemented as follows: Test/Set to get lock for structure If (lock was already used by other) then try above again Release lock Read and update structure Locks also introduce additional software complexity, because if priority inversion [3] and dead-lock situations [4] occur, they must be handled. This in turn can lead to live-locks [5]. To conclude, the traditional method of handling synchronization by using locks is not robust because it does not allow for efficient scaling. It is also very costly in terms of cycles. In a benchmark running bare-board on the P4080 and utilizing Freescale’s light-weight executive (LWE) library, a shared variable protected by locks took nearly 25 times as long to update compared to a private variable 1 due to the lock overhead alone. In the 1. Declared with “volatile” to ensure that no unfair compiler optimizations were used, and that a full read-update-write cycle was executed Freescale Semiconductor Freescale Confidential Proprietary 3 Preliminary—Subject to Change Without Notice
Page 1: Freescale Semiconductor Application
Page 5 and 6: Decorated Operations for QorIQ P3/P
Page 11 and 12: } atb_start = mfspr(SPR_ATBL); //st
Page 13 and 14: } APP_INFO(""); APP_INFO("Multicore
Page 15 and 16: static inline uint16_t decorated_lo
Page 17 and 18: eturn r; } static inline uint64_t d
Page 19 and 20: } enum STORE_DECORATION d = STORE_D
Page 21 and 22: 7 Summary Decorated Operations for

Decorated Operations for QorIQ P3/P4/P5 Processors - Freescale ...

Create successful ePaper yourself

Delete template?

Save as template?