Parallel Computing On Qualcomm Platforms Using OpenCL - Uplinq

Parallel Computing on 

Qualcomm 

® Platforms 

Using OpenCL 

Alex Bourd, Senior Staff Engineer & Manager, Qualcomm 

Qualcomm Proprietary 

OpenCL is a registered trademark of Apple Inc.

How can I harness all this power 

• Traditional parallel processing 

leverages similar, closely coupled 

processors 

• Mobile devices have many built-in 

processors, each with unique 

capabilities 

• Challenge: Capture and coordinate 

the power of disparate processors 

to solve a problem. 

• Harnessing this power opens up the 

opportunity for many applications, 

not previously possible on mobile 

devices 

2 

Qualcomm Proprietary

OpenCL – The Missing Piece 

• ”OpenCL (Open Computing Language) 

is the first open, royalty-free standard 

for general-purpose parallel 

programming of heterogeneous 

systems.” 

– Source: www.khronos.org 

• OpenCL provides a single 

programming API, single memory 

management model, and handles 

synchronization of data for different 

compute processors. 

• OpenCL makes applications a reality 

3 


Interaction with the real world 

Augmented Reality 

Object and Facial Recognition 

Gaming 

Fashion 

Landmarks 

Traffic scanning 

Navigation 

Sue 

Face tracking 

4 


Low Light Image Enhancement Effort 

OpenCL 

% Speed-up over C2D 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 

0 

Speed-up vs Effort 

1 2 3 4 5 6 

Effort (Man-weeks) 

•Wavelet-based image de-noising 

algorithm 

•Speed-up shown of GTX260 

GPU vs. Core2Duo CPU 

•Optimizations include global 

memory coalescing, tileprocessing, 

loop unrolling and 

NDRange optimizations. 

Core2Duo is a registered trademark of Intel Inc. 

5 


3D Processing: Depth Queue Enhancement 

6 


7 


Heterogeneous Compute Devices on Modern 

Mobile Platforms 

• A modern mobile processor contains multiple general purpose compute 

devices 

– Multiple CPUs with vector units 

• 2 Qualcomm Enhanced ARM CPUs 

@ 1.5GHz = 24 GFLOPS 

• Latency sensitive, control oriented 

– DSP 

• QDSP6 Multimedia DSP 2.4 GFLOPS 

• Fixed point signal Processing 

– GPGPU 

• Adreno @ 350MHz = 89.6 GFLOP 

• Latency tolerant, stream oriented 

• These compute devices must work well together to realize the computing 

potential of the system 

ARM is a registered trademark of ARM Inc. 

8 


General Purpose vs. Custom Devices 

• An Embedded OpenCL Platform has one or more OpenCL devices 

– Multicore CPU, GPGPU, DSP 

• These devices support programming via the OpenCL C API 

However: 

• Power is the absolute limiter in mobile: 

– Joule’s law dominates Moor’s law – performance is limited by power usage 

• Complex data types such as H.264 are better suited to dedicated HW 

• Qualcomm also incorporates additional Embedded Custom Devices for improved 

power and efficiency 

– Image effects processor 

– Image CODECs 

– Video CODECs 

– Audio CODECs 

– GPS device 

9 


Integrating Programmable and Custom Devices 

• Custom Device compute resources do not support OpenCL C 

– They might not be programmable 

– Their Instruction Set Architecture is insufficient for implementing general purpose APIs 

• Custom Devices need to become a part of OpenCL runtime 

– So that the application can use OpenCL scheduling mechanisms to distribute its 

workload 

– OpenCL memory mechanisms can be used to share data. 

• Qualcomm has proposed an Custom Device Extension for Khronos OpenCL to 

allow: 

– Standardized device-to-device synchronization with minimal CPU involvement 

• Including Custom and OpenCL C devices 

– Specialized HW devices to execute custom kernels using the host OpenCL kernel API 

• A kernel may accept 0 or more config parameters to define run-time behavior of the device 

– To enable proprietary firmware kernels 

• Proprietary kernels may be implemented for OpenCL 

10 


Qualcomm OpenCL Plans 

• Qualcomm plans to support OpenCL CLfor our customers to both: 

– Take advantage of high performance processing units on our chipsets 

– Leverage the performance and power efficiency of key fixed function 

processors 

• Our customers can leverage Qualcomm parallel compute technology to: 

– Provide applications, not possible before, for a large volume of handheld 

mobile devices 

– Bring into play previously dedicated or difficult to access compute devices 

• Qualcomm Enhanced NEON , QDSP, Adreno Graphics compute processors 

– Efficiently share data processing between compute devices 

– Implement value added video and imaging functionality 

• CODECs, Effects, Image recognition 

– Take advantage mobile power efficiency for complex applications 

NEON is a trademark of ARM Inc. 

11 


Current Qualcomm Platform 

• Current solution supports individual processing units sharing data through format 

conversions and memory copies 

• CPU does synchronization and control 

C++, 

Java, etc 

OpenGL 

ES 2.0 

Processor 

QCT ONLY 

DirectShow ® 

OpenMAX 

DirectShow 

MSM 

Multi-core 

CPU 

Processor 

GPGPU 

Processor 

QDSP 

Processor 

Video 

CODEC 

Processor 

Image 

Processor 

(VFE) 

CPU 

GPU 

DSP 

Memory 

Memory 

Memory 

Video 

CODEC 

Memory 

Image 

Memory 

• Data exchanged through format conversions (stride, pixel depth, color formats) and memory copies 

Java is a trademark of Sun Microsystems Inc. OpenGL is a registered trademark of Silicon Graphics Inc. OpenMAX is a registered trademark of the Khronos Group Inc. DirectShow is a registered trademark of the Microsoft Corporation 

12 


Qualcomm’s OpenCL Platform 

1-4 vector processors 

1-2GHz 

8-64 GFLOPS 

Multi-core 

CPU 

Processor 

GPGPU 

Graphics 

Processor 

32-128 ALUs (shader processors) 

200-400 MHz 

13 - 102 GFLOPS 

Adreno ® 

Graphics 

2D/Vector 

GPU 

OpenCL programmable API plus: 

• Fixed function subsystems for 

performance and power efficiency 

• Consistent data format and data 

memory layout between fixed and 

parallel compute subsystems 

• Consistent synchronization 

methods without CPU intervention 

• Consistent process control 

Video 

CODEC 

Processor 

OpenCL Application 

Computation, 

synchronization, process 

control, data sharing for 

heterogeneous processing 

environment 

HD Video 

Encoding/Decoding 

QDSP 

Processor 

Vector rendering, 

compositing 

Image 

Processor 

(VFE) 

Pixel format conversion, 

Image compression 

Image enhancement, 

Scale & rotate 

Signal Processing 

0.6-2.4 GFLOPS 

13 


3 

2 

3 

b2 

i 

3 

tb 

2 

i 

F 

3 

tb 

P 

2 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

F1 

tb 

P6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

UF tb 

P 

i 

UF 

t 

P 

UF 

P 

U 

3 

2 

3 

b2 

i 

3 

tb 

2 

i 

F3 

tb 

P2 

i 

U 

F1 

tb 

P6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

Ft 

b 

P 

i 

UF 

t 

P 

UF 

P 

U 

SP1 

Instruction / Constant L1 

HW Multi-threaded 

Scheduler 

Unified General Purpose 

Register File 

S 

p 

e 

c 

i 

a 

l 

F 

u 

n 

c 

Shared Memory 

t 

i 

o 

n 

SP3 F 


P 

U 


Scheduler 


Register File 

Shared Memory 

3 

2 

3 

b2 

i 

3 

tb 

2 

i 

F 

3 

tb 

P 

2 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

F1 

tb 

P6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

UF tb 

P 

i 

UF 

t 

P 

UF 

P 

U 

3 

2 

3 

b2 

i 

3 

tb 

2 

i 

F3 

tb 

P2 

i 

U 

F1 

tb 

P6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

F 

1 

tb 

P 

6 

i 

U 

Ft 

b 

P 

i 

UF 

t 

P 

UF 

P 

U 

SP2 



Scheduler 


Register File 

S 

p 

e 

c 

i 

a 

l 

F 

u 

n 

c 

Shared Memory 

t 

i 

o 

n 

SP4 F 


P 

U 


Scheduler 


Register File 

S 

p 

e 

c 

i 

a 

l 

F 

u 

n 

c 

t 

i 

o 

n 

F 

P 

U 

Shared Memory 

Load Balancing 

Multiple device workload distribution 

Low Light 

Enhancement 

Game 

Rendering 

GPGPU 

Graphics 

Processor 

Adreno 

Graphics 

Font 

Rendering 

Surface 

Compositing 

GPGPU OpenVG Vector 

Graphics 

Processor 

zapper troll lea 

Game 

Physics 

Web Page 

Layout 

Game Scene 

Graph 

Parsing 

Multi-core 

CPU 

Processors 

OpenCL Load 

Balancing 

Web based game 

application with video 

conferencing example 

Bayer 

Filtering 

Video format 

Conversion 

VFE Image 

Processor 

Video conference game example 

OpenVG is a trademark of the Khronos Group Inc. 

Incoming 

Video 

Decode 

Video CODEC 

Processor 

Background 

Noise Filter 

QDSP Signal 

Processor 

14 


View of Next Generation OpenCL 

* * * * * 

CU- 3D (SIMD) 

+ * - 

CU – QDSP/ARM 

- 

Semaphores 

Semaphores 

Queue 

Communication Fabric 

Qualcomm Wait2 

Enhanced Send1 Task3 ARM 

Task2 Task1 Wait1 

CPU 

Send2 

Host 

Semaphor res 

Semaphores 

HW -Video 

15 


OpenCL System Overview 

How it works 

16 


CPUs 

Multiple cores driving 

performance increases 

Emerging 

g 

Intersection 

GPGPUs 

Increasingly general purpose 

data-parallel l computing 

Improving numerical precision 

Multi-processor 

programming – 

e.g. OpenMP 

OpenCL 

Heterogenous 

Computing 

Graphics APIs 

and Shading 

Languages 

OpenMP is a trademark of he OpenMP Architecture Review Board 

OpenCL – Open Computing Language 

Open, royalty-free standard for portable, parallel programming of heterogeneous 

parallel computing CPUs, GPUs, and other processors 

Source: www.khronos.org 

17 


OpenCL Still Requires Smart Developers to 

Leverage Compute Devices Effectively 

CPUs are 

• Good at complex logic/branching 

• Have several cores 

• Good at very long programs 

• Don’t have image filtering 

hardware 

• Don’t have local memory 

• Have hierarchical memory cache 

• Good for random data 

GPUs are 

• Bad at complex logic/branching 

• Have hundreds of cores 

• Bad at long programs 

• Have efficient image filtering 

hardware 

• Have local on-chip memory 

• Have simple or no memory cache 

• Good for streaming data 

• Many practical algorithms require some of both – simple highly parallel steps and 

steps that require very long programs with complex logic and branching 

• It would be natural to implement such algorithms on a combination of CPU and 

GPGPU, each doing parts at which it is really good 

18 


However OpenCL Provides for Consistent APIs 

• Need to unify the concept of a “computing circuit” 

it” 

– CPU and GPU cores are apples and oranges, need some common concept 

– Came up with a common concept of an OpenCL Device, though still kept 

three distinct device classes based on market reality: 

• CL_CPU (Qualcomm Enhanced ARM CPU, including Enhanced NEON) 

• CL_ACCELERATOR (QDSP) 

• CL_GPU (Adreno Graphics) 

• Emerging “Fixed Functionality Device 

– such as video decoding hardware, camera demosaic hardware 

• Need to unify memory classes 

– Global memory 

– On-chip memory caches 

– On-chip shared memory 

– Register files 

19 


Anatomy of an OpenCL Platform 

• A host computer can have one or more OpenCL devices 

• Each OpenCL device has one or more compute units 

• Each compute unit has one or more processing elements 

OpenCL devices 

20 


OpenCL Memory Model 

• Memory Types 

– Private: on-chip, temp register file 

– Local: on-chip, read-write, shared within a compute 

unit 

– Constant: on-chip memory, read-only, shared within 

compute unit 

– Global: system memory, shared between all 

compute units, can be on host or device 

• Memory model 

– OpenCL provides barrier mechanism for 

synchronization 

o Memory is assumed to be undefined across work items 

unless explicitly synchronized 

– Multiple distinct address spaces 

– Address spaces can be collapsed depending on the 

device’s memory subsystem 

• Diagram shows typical memory layout for a 

GPU architecture 

t 

Private 

Memory 

Private 

Memory 

Private 

Memory 

Private 

Memory 

ALU #1 ALU #M ALU #1 ALU #M 

Local Memory 

Compute Unit 1 

Compute Device 

Local Memory 

Compute Unit N 

Global / Constant Memory Data Cache 

Global Memory 

Compute Device Memory 

21 


Memory specifics for various devices 

• CPU generally doesn’t have physical local l memory, local l memory can be 

simulated using global 

• Fixed functionality devices (such as video decoder) might only have global 

memory access) 

• GPU might have dedicated global memory or have global memory shared with 

CPU 

• Global memory cache is optional; however, if present, global cache layout is 

visible to applications (OpenCL query for size and line size) 

• Minimum requirements for embedded systems: 

– Local memory:1 KB per compute unit 

– Const on-chip memory: 4 KB per compute unit 

– Global on-chip cache: Optional 

– Global memory: 1MB 

22 


OpenCL data set decomposition into 

work groups & work items 

A WORK ITEM is a point in a 

data set 

WORK GROUPS made of 

adjacent WORK ITEMS; WORK 

GROUPS share data through local 

memory 

One WORK GROUP at 

a time per compute unit 

23 


Concurrent applications running on heterogeneous 

multi-core platform 

24 


Questions 

End 

25 


Disclaimer 

Nothing in these materials is an offer to sell any of the components or devices referenced 

herein. Certain components for use in the U.S. are available only through licensed suppliers. 

Some components are not available for use in the U.S. 

26 

Qualcomm Proprietary 

3/3/09

Trademarks 

Qualcomm is a registered trademark of Qualcomm Incorporated in the United States and 

may be registered in other countries. UPLINQ is a trademark of Qualcomm Incorporated. 

Other product and brand names may be trademarks or registered trademarks of their 

respective owners 

27

Parallel Computing On Qualcomm Platforms Using OpenCL - Uplinq

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?