Parallel Computing On Qualcomm Platforms Using OpenCL - Uplinq
Parallel Computing On Qualcomm Platforms Using OpenCL - Uplinq
Parallel Computing On Qualcomm Platforms Using OpenCL - Uplinq
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Parallel</strong> <strong>Computing</strong> on<br />
<strong>Qualcomm</strong><br />
® <strong>Platforms</strong><br />
<strong>Using</strong> <strong>OpenCL</strong> <br />
Alex Bourd, Senior Staff Engineer & Manager, <strong>Qualcomm</strong><br />
<strong>Qualcomm</strong> Proprietary<br />
<strong>OpenCL</strong> is a registered trademark of Apple Inc.
How can I harness all this power<br />
• Traditional parallel processing<br />
leverages similar, closely coupled<br />
processors<br />
• Mobile devices have many built-in<br />
processors, each with unique<br />
capabilities<br />
• Challenge: Capture and coordinate<br />
the power of disparate processors<br />
to solve a problem.<br />
• Harnessing this power opens up the<br />
opportunity for many applications,<br />
not previously possible on mobile<br />
devices<br />
2<br />
<strong>Qualcomm</strong> Proprietary
<strong>OpenCL</strong> – The Missing Piece<br />
• ”<strong>OpenCL</strong> (Open <strong>Computing</strong> Language)<br />
is the first open, royalty-free standard<br />
for general-purpose parallel<br />
programming of heterogeneous<br />
systems.”<br />
– Source: www.khronos.org<br />
• <strong>OpenCL</strong> provides a single<br />
programming API, single memory<br />
management model, and handles<br />
synchronization of data for different<br />
compute processors.<br />
• <strong>OpenCL</strong> makes applications a reality<br />
3<br />
<strong>Qualcomm</strong> Proprietary
Interaction with the real world<br />
Augmented Reality<br />
Object and Facial Recognition<br />
Gaming<br />
Fashion<br />
Landmarks<br />
Traffic scanning<br />
Navigation<br />
Sue<br />
Face tracking<br />
4<br />
<strong>Qualcomm</strong> Proprietary
Low Light Image Enhancement Effort<br />
<strong>OpenCL</strong><br />
% Speed-up over C2D<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
Speed-up vs Effort<br />
1 2 3 4 5 6<br />
Effort (Man-weeks)<br />
•Wavelet-based image de-noising<br />
algorithm<br />
•Speed-up shown of GTX260<br />
GPU vs. Core2Duo CPU<br />
•Optimizations include global<br />
memory coalescing, tileprocessing,<br />
loop unrolling and<br />
NDRange optimizations.<br />
Core2Duo is a registered trademark of Intel Inc.<br />
5<br />
<strong>Qualcomm</strong> Proprietary
3D Processing: Depth Queue Enhancement<br />
6<br />
<strong>Qualcomm</strong> Proprietary
7<br />
<strong>Qualcomm</strong> Proprietary
Heterogeneous Compute Devices on Modern<br />
Mobile <strong>Platforms</strong><br />
• A modern mobile processor contains multiple general purpose compute<br />
devices<br />
– Multiple CPUs with vector units<br />
• 2 <strong>Qualcomm</strong> Enhanced ARM CPUs<br />
@ 1.5GHz = 24 GFLOPS<br />
• Latency sensitive, control oriented<br />
– DSP<br />
• QDSP6 Multimedia DSP 2.4 GFLOPS<br />
• Fixed point signal Processing<br />
– GPGPU<br />
• Adreno @ 350MHz = 89.6 GFLOP<br />
• Latency tolerant, stream oriented<br />
• These compute devices must work well together to realize the computing<br />
potential of the system<br />
ARM is a registered trademark of ARM Inc.<br />
8<br />
<strong>Qualcomm</strong> Proprietary
General Purpose vs. Custom Devices<br />
• An Embedded <strong>OpenCL</strong> Platform has one or more <strong>OpenCL</strong> devices<br />
– Multicore CPU, GPGPU, DSP<br />
• These devices support programming via the <strong>OpenCL</strong> C API<br />
However:<br />
• Power is the absolute limiter in mobile:<br />
– Joule’s law dominates Moor’s law – performance is limited by power usage<br />
• Complex data types such as H.264 are better suited to dedicated HW<br />
• <strong>Qualcomm</strong> also incorporates additional Embedded Custom Devices for improved<br />
power and efficiency<br />
– Image effects processor<br />
– Image CODECs<br />
– Video CODECs<br />
– Audio CODECs<br />
– GPS device<br />
9<br />
<strong>Qualcomm</strong> Proprietary
Integrating Programmable and Custom Devices<br />
• Custom Device compute resources do not support <strong>OpenCL</strong> C<br />
– They might not be programmable<br />
– Their Instruction Set Architecture is insufficient for implementing general purpose APIs<br />
• Custom Devices need to become a part of <strong>OpenCL</strong> runtime<br />
– So that the application can use <strong>OpenCL</strong> scheduling mechanisms to distribute its<br />
workload<br />
– <strong>OpenCL</strong> memory mechanisms can be used to share data.<br />
• <strong>Qualcomm</strong> has proposed an Custom Device Extension for Khronos <strong>OpenCL</strong> to<br />
allow:<br />
– Standardized device-to-device synchronization with minimal CPU involvement<br />
• Including Custom and <strong>OpenCL</strong> C devices<br />
– Specialized HW devices to execute custom kernels using the host <strong>OpenCL</strong> kernel API<br />
• A kernel may accept 0 or more config parameters to define run-time behavior of the device<br />
– To enable proprietary firmware kernels<br />
• Proprietary kernels may be implemented for <strong>OpenCL</strong><br />
10<br />
<strong>Qualcomm</strong> Proprietary
<strong>Qualcomm</strong> <strong>OpenCL</strong> Plans<br />
• <strong>Qualcomm</strong> plans to support <strong>OpenCL</strong> CLfor our customers to both:<br />
– Take advantage of high performance processing units on our chipsets<br />
– Leverage the performance and power efficiency of key fixed function<br />
processors<br />
• Our customers can leverage <strong>Qualcomm</strong> parallel compute technology to:<br />
– Provide applications, not possible before, for a large volume of handheld<br />
mobile devices<br />
– Bring into play previously dedicated or difficult to access compute devices<br />
• <strong>Qualcomm</strong> Enhanced NEON , QDSP, Adreno Graphics compute processors<br />
– Efficiently share data processing between compute devices<br />
– Implement value added video and imaging functionality<br />
• CODECs, Effects, Image recognition<br />
– Take advantage mobile power efficiency for complex applications<br />
NEON is a trademark of ARM Inc.<br />
11<br />
<strong>Qualcomm</strong> Proprietary
Current <strong>Qualcomm</strong> Platform<br />
• Current solution supports individual processing units sharing data through format<br />
conversions and memory copies<br />
• CPU does synchronization and control<br />
C++,<br />
Java, etc<br />
OpenGL<br />
ES 2.0<br />
Processor<br />
QCT ONLY<br />
DirectShow ®<br />
OpenMAX<br />
DirectShow<br />
MSM<br />
Multi-core<br />
CPU<br />
Processor<br />
GPGPU<br />
Processor<br />
QDSP<br />
Processor<br />
Video<br />
CODEC<br />
Processor<br />
Image<br />
Processor<br />
(VFE)<br />
CPU<br />
GPU<br />
DSP<br />
Memory<br />
Memory<br />
Memory<br />
Video<br />
CODEC<br />
Memory<br />
Image<br />
Memory<br />
• Data exchanged through format conversions (stride, pixel depth, color formats) and memory copies<br />
Java is a trademark of Sun Microsystems Inc. OpenGL is a registered trademark of Silicon Graphics Inc. OpenMAX is a registered trademark of the Khronos Group Inc. DirectShow is a registered trademark of the Microsoft Corporation<br />
12<br />
<strong>Qualcomm</strong> Proprietary
<strong>Qualcomm</strong>’s <strong>OpenCL</strong> Platform<br />
1-4 vector processors<br />
1-2GHz<br />
8-64 GFLOPS<br />
Multi-core<br />
CPU<br />
Processor<br />
GPGPU<br />
Graphics<br />
Processor<br />
32-128 ALUs (shader processors)<br />
200-400 MHz<br />
13 - 102 GFLOPS<br />
Adreno ®<br />
Graphics<br />
2D/Vector<br />
GPU<br />
<strong>OpenCL</strong> programmable API plus:<br />
• Fixed function subsystems for<br />
performance and power efficiency<br />
• Consistent data format and data<br />
memory layout between fixed and<br />
parallel compute subsystems<br />
• Consistent synchronization<br />
methods without CPU intervention<br />
• Consistent process control<br />
Video<br />
CODEC<br />
Processor<br />
<strong>OpenCL</strong> Application<br />
Computation,<br />
synchronization, process<br />
control, data sharing for<br />
heterogeneous processing<br />
environment<br />
HD Video<br />
Encoding/Decoding<br />
QDSP<br />
Processor<br />
Vector rendering,<br />
compositing<br />
Image<br />
Processor<br />
(VFE)<br />
Pixel format conversion,<br />
Image compression<br />
Image enhancement,<br />
Scale & rotate<br />
Signal Processing<br />
0.6-2.4 GFLOPS<br />
13<br />
<strong>Qualcomm</strong> Proprietary
3<br />
2<br />
3<br />
b2<br />
i<br />
3<br />
tb<br />
2<br />
i<br />
F<br />
3<br />
tb<br />
P<br />
2<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
F1<br />
tb<br />
P6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
UF tb<br />
P<br />
i<br />
UF<br />
t<br />
P<br />
UF<br />
P<br />
U<br />
3<br />
2<br />
3<br />
b2<br />
i<br />
3<br />
tb<br />
2<br />
i<br />
F3<br />
tb<br />
P2<br />
i<br />
U<br />
F1<br />
tb<br />
P6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
Ft<br />
b<br />
P<br />
i<br />
UF<br />
t<br />
P<br />
UF<br />
P<br />
U<br />
SP1<br />
Instruction / Constant L1<br />
HW Multi-threaded<br />
Scheduler<br />
Unified General Purpose<br />
Register File<br />
S<br />
p<br />
e<br />
c<br />
i<br />
a<br />
l<br />
F<br />
u<br />
n<br />
c<br />
Shared Memory<br />
t<br />
i<br />
o<br />
n<br />
SP3 F<br />
Instruction / Constant L1<br />
P<br />
U<br />
HW Multi-threaded<br />
Scheduler<br />
Unified General Purpose<br />
Register File<br />
Shared Memory<br />
3<br />
2<br />
3<br />
b2<br />
i<br />
3<br />
tb<br />
2<br />
i<br />
F<br />
3<br />
tb<br />
P<br />
2<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
F1<br />
tb<br />
P6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
UF tb<br />
P<br />
i<br />
UF<br />
t<br />
P<br />
UF<br />
P<br />
U<br />
3<br />
2<br />
3<br />
b2<br />
i<br />
3<br />
tb<br />
2<br />
i<br />
F3<br />
tb<br />
P2<br />
i<br />
U<br />
F1<br />
tb<br />
P6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
F<br />
1<br />
tb<br />
P<br />
6<br />
i<br />
U<br />
Ft<br />
b<br />
P<br />
i<br />
UF<br />
t<br />
P<br />
UF<br />
P<br />
U<br />
SP2<br />
Instruction / Constant L1<br />
HW Multi-threaded<br />
Scheduler<br />
Unified General Purpose<br />
Register File<br />
S<br />
p<br />
e<br />
c<br />
i<br />
a<br />
l<br />
F<br />
u<br />
n<br />
c<br />
Shared Memory<br />
t<br />
i<br />
o<br />
n<br />
SP4 F<br />
Instruction / Constant L1<br />
P<br />
U<br />
HW Multi-threaded<br />
Scheduler<br />
Unified General Purpose<br />
Register File<br />
S<br />
p<br />
e<br />
c<br />
i<br />
a<br />
l<br />
F<br />
u<br />
n<br />
c<br />
t<br />
i<br />
o<br />
n<br />
F<br />
P<br />
U<br />
Shared Memory<br />
Load Balancing<br />
Multiple device workload distribution<br />
Low Light<br />
Enhancement<br />
Game<br />
Rendering<br />
GPGPU<br />
Graphics<br />
Processor<br />
Adreno<br />
Graphics<br />
Font<br />
Rendering<br />
Surface<br />
Compositing<br />
GPGPU OpenVG Vector<br />
Graphics<br />
Processor<br />
zapper troll lea<br />
Game<br />
Physics<br />
Web Page<br />
Layout<br />
Game Scene<br />
Graph<br />
Parsing<br />
Multi-core<br />
CPU<br />
Processors<br />
<strong>OpenCL</strong> Load<br />
Balancing<br />
Web based game<br />
application with video<br />
conferencing example<br />
Bayer<br />
Filtering<br />
Video format<br />
Conversion<br />
VFE Image<br />
Processor<br />
Video conference game example<br />
OpenVG is a trademark of the Khronos Group Inc.<br />
Incoming<br />
Video<br />
Decode<br />
Video CODEC<br />
Processor<br />
Background<br />
Noise Filter<br />
QDSP Signal<br />
Processor<br />
14<br />
<strong>Qualcomm</strong> Proprietary
View of Next Generation <strong>OpenCL</strong><br />
* * * * *<br />
CU- 3D (SIMD)<br />
+ * -<br />
CU – QDSP/ARM<br />
-<br />
Semaphores<br />
Semaphores<br />
Queue<br />
Communication Fabric<br />
<strong>Qualcomm</strong> Wait2<br />
Enhanced Send1 Task3 ARM<br />
Task2 Task1 Wait1<br />
CPU<br />
Send2<br />
Host<br />
Semaphor res<br />
Semaphores<br />
HW -Video<br />
15<br />
<strong>Qualcomm</strong> Proprietary
<strong>OpenCL</strong> System Overview<br />
How it works<br />
16<br />
<strong>Qualcomm</strong> Proprietary
CPUs<br />
Multiple cores driving<br />
performance increases<br />
Emerging<br />
g<br />
Intersection<br />
GPGPUs<br />
Increasingly general purpose<br />
data-parallel l computing<br />
Improving numerical precision<br />
Multi-processor<br />
programming –<br />
e.g. OpenMP<br />
<strong>OpenCL</strong><br />
Heterogenous<br />
<strong>Computing</strong><br />
Graphics APIs<br />
and Shading<br />
Languages<br />
OpenMP is a trademark of he OpenMP Architecture Review Board<br />
<strong>OpenCL</strong> – Open <strong>Computing</strong> Language<br />
Open, royalty-free standard for portable, parallel programming of heterogeneous<br />
parallel computing CPUs, GPUs, and other processors<br />
Source: www.khronos.org<br />
17<br />
<strong>Qualcomm</strong> Proprietary
<strong>OpenCL</strong> Still Requires Smart Developers to<br />
Leverage Compute Devices Effectively<br />
CPUs are<br />
• Good at complex logic/branching<br />
• Have several cores<br />
• Good at very long programs<br />
• Don’t have image filtering<br />
hardware<br />
• Don’t have local memory<br />
• Have hierarchical memory cache<br />
• Good for random data<br />
GPUs are<br />
• Bad at complex logic/branching<br />
• Have hundreds of cores<br />
• Bad at long programs<br />
• Have efficient image filtering<br />
hardware<br />
• Have local on-chip memory<br />
• Have simple or no memory cache<br />
• Good for streaming data<br />
• Many practical algorithms require some of both – simple highly parallel steps and<br />
steps that require very long programs with complex logic and branching<br />
• It would be natural to implement such algorithms on a combination of CPU and<br />
GPGPU, each doing parts at which it is really good<br />
18<br />
<strong>Qualcomm</strong> Proprietary
However <strong>OpenCL</strong> Provides for Consistent APIs<br />
• Need to unify the concept of a “computing circuit”<br />
it”<br />
– CPU and GPU cores are apples and oranges, need some common concept<br />
– Came up with a common concept of an <strong>OpenCL</strong> Device, though still kept<br />
three distinct device classes based on market reality:<br />
• CL_CPU (<strong>Qualcomm</strong> Enhanced ARM CPU, including Enhanced NEON)<br />
• CL_ACCELERATOR (QDSP)<br />
• CL_GPU (Adreno Graphics)<br />
• Emerging “Fixed Functionality Device<br />
– such as video decoding hardware, camera demosaic hardware<br />
• Need to unify memory classes<br />
– Global memory<br />
– <strong>On</strong>-chip memory caches<br />
– <strong>On</strong>-chip shared memory<br />
– Register files<br />
19<br />
<strong>Qualcomm</strong> Proprietary
Anatomy of an <strong>OpenCL</strong> Platform<br />
• A host computer can have one or more <strong>OpenCL</strong> devices<br />
• Each <strong>OpenCL</strong> device has one or more compute units<br />
• Each compute unit has one or more processing elements<br />
<strong>OpenCL</strong> devices<br />
20<br />
<strong>Qualcomm</strong> Proprietary
<strong>OpenCL</strong> Memory Model<br />
• Memory Types<br />
– Private: on-chip, temp register file<br />
– Local: on-chip, read-write, shared within a compute<br />
unit<br />
– Constant: on-chip memory, read-only, shared within<br />
compute unit<br />
– Global: system memory, shared between all<br />
compute units, can be on host or device<br />
• Memory model<br />
– <strong>OpenCL</strong> provides barrier mechanism for<br />
synchronization<br />
o Memory is assumed to be undefined across work items<br />
unless explicitly synchronized<br />
– Multiple distinct address spaces<br />
– Address spaces can be collapsed depending on the<br />
device’s memory subsystem<br />
• Diagram shows typical memory layout for a<br />
GPU architecture<br />
t<br />
Private<br />
Memory<br />
Private<br />
Memory<br />
Private<br />
Memory<br />
Private<br />
Memory<br />
ALU #1 ALU #M ALU #1 ALU #M<br />
Local Memory<br />
Compute Unit 1<br />
Compute Device<br />
Local Memory<br />
Compute Unit N<br />
Global / Constant Memory Data Cache<br />
Global Memory<br />
Compute Device Memory<br />
21<br />
<strong>Qualcomm</strong> Proprietary
Memory specifics for various devices<br />
• CPU generally doesn’t have physical local l memory, local l memory can be<br />
simulated using global<br />
• Fixed functionality devices (such as video decoder) might only have global<br />
memory access)<br />
• GPU might have dedicated global memory or have global memory shared with<br />
CPU<br />
• Global memory cache is optional; however, if present, global cache layout is<br />
visible to applications (<strong>OpenCL</strong> query for size and line size)<br />
• Minimum requirements for embedded systems:<br />
– Local memory:1 KB per compute unit<br />
– Const on-chip memory: 4 KB per compute unit<br />
– Global on-chip cache: Optional<br />
– Global memory: 1MB<br />
22<br />
<strong>Qualcomm</strong> Proprietary
<strong>OpenCL</strong> data set decomposition into<br />
work groups & work items<br />
A WORK ITEM is a point in a<br />
data set<br />
WORK GROUPS made of<br />
adjacent WORK ITEMS; WORK<br />
GROUPS share data through local<br />
memory<br />
<strong>On</strong>e WORK GROUP at<br />
a time per compute unit<br />
23<br />
<strong>Qualcomm</strong> Proprietary
Concurrent applications running on heterogeneous<br />
multi-core platform<br />
24<br />
<strong>Qualcomm</strong> Proprietary
Questions<br />
End<br />
25<br />
<strong>Qualcomm</strong> Proprietary
Disclaimer<br />
Nothing in these materials is an offer to sell any of the components or devices referenced<br />
herein. Certain components for use in the U.S. are available only through licensed suppliers.<br />
Some components are not available for use in the U.S.<br />
26<br />
<strong>Qualcomm</strong> Proprietary<br />
3/3/09
Trademarks<br />
<strong>Qualcomm</strong> is a registered trademark of <strong>Qualcomm</strong> Incorporated in the United States and<br />
may be registered in other countries. UPLINQ is a trademark of <strong>Qualcomm</strong> Incorporated.<br />
Other product and brand names may be trademarks or registered trademarks of their<br />
respective owners<br />
27<br />
<strong>Qualcomm</strong> Proprietary