30.01.2015 Views

Parallel Computing On Qualcomm Platforms Using OpenCL - Uplinq

Parallel Computing On Qualcomm Platforms Using OpenCL - Uplinq

Parallel Computing On Qualcomm Platforms Using OpenCL - Uplinq

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Parallel</strong> <strong>Computing</strong> on<br />

<strong>Qualcomm</strong><br />

® <strong>Platforms</strong><br />

<strong>Using</strong> <strong>OpenCL</strong> <br />

Alex Bourd, Senior Staff Engineer & Manager, <strong>Qualcomm</strong><br />

<strong>Qualcomm</strong> Proprietary<br />

<strong>OpenCL</strong> is a registered trademark of Apple Inc.


How can I harness all this power<br />

• Traditional parallel processing<br />

leverages similar, closely coupled<br />

processors<br />

• Mobile devices have many built-in<br />

processors, each with unique<br />

capabilities<br />

• Challenge: Capture and coordinate<br />

the power of disparate processors<br />

to solve a problem.<br />

• Harnessing this power opens up the<br />

opportunity for many applications,<br />

not previously possible on mobile<br />

devices<br />

2<br />

<strong>Qualcomm</strong> Proprietary


<strong>OpenCL</strong> – The Missing Piece<br />

• ”<strong>OpenCL</strong> (Open <strong>Computing</strong> Language)<br />

is the first open, royalty-free standard<br />

for general-purpose parallel<br />

programming of heterogeneous<br />

systems.”<br />

– Source: www.khronos.org<br />

• <strong>OpenCL</strong> provides a single<br />

programming API, single memory<br />

management model, and handles<br />

synchronization of data for different<br />

compute processors.<br />

• <strong>OpenCL</strong> makes applications a reality<br />

3<br />

<strong>Qualcomm</strong> Proprietary


Interaction with the real world<br />

Augmented Reality<br />

Object and Facial Recognition<br />

Gaming<br />

Fashion<br />

Landmarks<br />

Traffic scanning<br />

Navigation<br />

Sue<br />

Face tracking<br />

4<br />

<strong>Qualcomm</strong> Proprietary


Low Light Image Enhancement Effort<br />

<strong>OpenCL</strong><br />

% Speed-up over C2D<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Speed-up vs Effort<br />

1 2 3 4 5 6<br />

Effort (Man-weeks)<br />

•Wavelet-based image de-noising<br />

algorithm<br />

•Speed-up shown of GTX260<br />

GPU vs. Core2Duo CPU<br />

•Optimizations include global<br />

memory coalescing, tileprocessing,<br />

loop unrolling and<br />

NDRange optimizations.<br />

Core2Duo is a registered trademark of Intel Inc.<br />

5<br />

<strong>Qualcomm</strong> Proprietary


3D Processing: Depth Queue Enhancement<br />

6<br />

<strong>Qualcomm</strong> Proprietary


7<br />

<strong>Qualcomm</strong> Proprietary


Heterogeneous Compute Devices on Modern<br />

Mobile <strong>Platforms</strong><br />

• A modern mobile processor contains multiple general purpose compute<br />

devices<br />

– Multiple CPUs with vector units<br />

• 2 <strong>Qualcomm</strong> Enhanced ARM CPUs<br />

@ 1.5GHz = 24 GFLOPS<br />

• Latency sensitive, control oriented<br />

– DSP<br />

• QDSP6 Multimedia DSP 2.4 GFLOPS<br />

• Fixed point signal Processing<br />

– GPGPU<br />

• Adreno @ 350MHz = 89.6 GFLOP<br />

• Latency tolerant, stream oriented<br />

• These compute devices must work well together to realize the computing<br />

potential of the system<br />

ARM is a registered trademark of ARM Inc.<br />

8<br />

<strong>Qualcomm</strong> Proprietary


General Purpose vs. Custom Devices<br />

• An Embedded <strong>OpenCL</strong> Platform has one or more <strong>OpenCL</strong> devices<br />

– Multicore CPU, GPGPU, DSP<br />

• These devices support programming via the <strong>OpenCL</strong> C API<br />

However:<br />

• Power is the absolute limiter in mobile:<br />

– Joule’s law dominates Moor’s law – performance is limited by power usage<br />

• Complex data types such as H.264 are better suited to dedicated HW<br />

• <strong>Qualcomm</strong> also incorporates additional Embedded Custom Devices for improved<br />

power and efficiency<br />

– Image effects processor<br />

– Image CODECs<br />

– Video CODECs<br />

– Audio CODECs<br />

– GPS device<br />

9<br />

<strong>Qualcomm</strong> Proprietary


Integrating Programmable and Custom Devices<br />

• Custom Device compute resources do not support <strong>OpenCL</strong> C<br />

– They might not be programmable<br />

– Their Instruction Set Architecture is insufficient for implementing general purpose APIs<br />

• Custom Devices need to become a part of <strong>OpenCL</strong> runtime<br />

– So that the application can use <strong>OpenCL</strong> scheduling mechanisms to distribute its<br />

workload<br />

– <strong>OpenCL</strong> memory mechanisms can be used to share data.<br />

• <strong>Qualcomm</strong> has proposed an Custom Device Extension for Khronos <strong>OpenCL</strong> to<br />

allow:<br />

– Standardized device-to-device synchronization with minimal CPU involvement<br />

• Including Custom and <strong>OpenCL</strong> C devices<br />

– Specialized HW devices to execute custom kernels using the host <strong>OpenCL</strong> kernel API<br />

• A kernel may accept 0 or more config parameters to define run-time behavior of the device<br />

– To enable proprietary firmware kernels<br />

• Proprietary kernels may be implemented for <strong>OpenCL</strong><br />

10<br />

<strong>Qualcomm</strong> Proprietary


<strong>Qualcomm</strong> <strong>OpenCL</strong> Plans<br />

• <strong>Qualcomm</strong> plans to support <strong>OpenCL</strong> CLfor our customers to both:<br />

– Take advantage of high performance processing units on our chipsets<br />

– Leverage the performance and power efficiency of key fixed function<br />

processors<br />

• Our customers can leverage <strong>Qualcomm</strong> parallel compute technology to:<br />

– Provide applications, not possible before, for a large volume of handheld<br />

mobile devices<br />

– Bring into play previously dedicated or difficult to access compute devices<br />

• <strong>Qualcomm</strong> Enhanced NEON , QDSP, Adreno Graphics compute processors<br />

– Efficiently share data processing between compute devices<br />

– Implement value added video and imaging functionality<br />

• CODECs, Effects, Image recognition<br />

– Take advantage mobile power efficiency for complex applications<br />

NEON is a trademark of ARM Inc.<br />

11<br />

<strong>Qualcomm</strong> Proprietary


Current <strong>Qualcomm</strong> Platform<br />

• Current solution supports individual processing units sharing data through format<br />

conversions and memory copies<br />

• CPU does synchronization and control<br />

C++,<br />

Java, etc<br />

OpenGL<br />

ES 2.0<br />

Processor<br />

QCT ONLY<br />

DirectShow ®<br />

OpenMAX<br />

DirectShow<br />

MSM<br />

Multi-core<br />

CPU<br />

Processor<br />

GPGPU<br />

Processor<br />

QDSP<br />

Processor<br />

Video<br />

CODEC<br />

Processor<br />

Image<br />

Processor<br />

(VFE)<br />

CPU<br />

GPU<br />

DSP<br />

Memory<br />

Memory<br />

Memory<br />

Video<br />

CODEC<br />

Memory<br />

Image<br />

Memory<br />

• Data exchanged through format conversions (stride, pixel depth, color formats) and memory copies<br />

Java is a trademark of Sun Microsystems Inc. OpenGL is a registered trademark of Silicon Graphics Inc. OpenMAX is a registered trademark of the Khronos Group Inc. DirectShow is a registered trademark of the Microsoft Corporation<br />

12<br />

<strong>Qualcomm</strong> Proprietary


<strong>Qualcomm</strong>’s <strong>OpenCL</strong> Platform<br />

1-4 vector processors<br />

1-2GHz<br />

8-64 GFLOPS<br />

Multi-core<br />

CPU<br />

Processor<br />

GPGPU<br />

Graphics<br />

Processor<br />

32-128 ALUs (shader processors)<br />

200-400 MHz<br />

13 - 102 GFLOPS<br />

Adreno ®<br />

Graphics<br />

2D/Vector<br />

GPU<br />

<strong>OpenCL</strong> programmable API plus:<br />

• Fixed function subsystems for<br />

performance and power efficiency<br />

• Consistent data format and data<br />

memory layout between fixed and<br />

parallel compute subsystems<br />

• Consistent synchronization<br />

methods without CPU intervention<br />

• Consistent process control<br />

Video<br />

CODEC<br />

Processor<br />

<strong>OpenCL</strong> Application<br />

Computation,<br />

synchronization, process<br />

control, data sharing for<br />

heterogeneous processing<br />

environment<br />

HD Video<br />

Encoding/Decoding<br />

QDSP<br />

Processor<br />

Vector rendering,<br />

compositing<br />

Image<br />

Processor<br />

(VFE)<br />

Pixel format conversion,<br />

Image compression<br />

Image enhancement,<br />

Scale & rotate<br />

Signal Processing<br />

0.6-2.4 GFLOPS<br />

13<br />

<strong>Qualcomm</strong> Proprietary


3<br />

2<br />

3<br />

b2<br />

i<br />

3<br />

tb<br />

2<br />

i<br />

F<br />

3<br />

tb<br />

P<br />

2<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

F1<br />

tb<br />

P6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

UF tb<br />

P<br />

i<br />

UF<br />

t<br />

P<br />

UF<br />

P<br />

U<br />

3<br />

2<br />

3<br />

b2<br />

i<br />

3<br />

tb<br />

2<br />

i<br />

F3<br />

tb<br />

P2<br />

i<br />

U<br />

F1<br />

tb<br />

P6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

Ft<br />

b<br />

P<br />

i<br />

UF<br />

t<br />

P<br />

UF<br />

P<br />

U<br />

SP1<br />

Instruction / Constant L1<br />

HW Multi-threaded<br />

Scheduler<br />

Unified General Purpose<br />

Register File<br />

S<br />

p<br />

e<br />

c<br />

i<br />

a<br />

l<br />

F<br />

u<br />

n<br />

c<br />

Shared Memory<br />

t<br />

i<br />

o<br />

n<br />

SP3 F<br />

Instruction / Constant L1<br />

P<br />

U<br />

HW Multi-threaded<br />

Scheduler<br />

Unified General Purpose<br />

Register File<br />

Shared Memory<br />

3<br />

2<br />

3<br />

b2<br />

i<br />

3<br />

tb<br />

2<br />

i<br />

F<br />

3<br />

tb<br />

P<br />

2<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

F1<br />

tb<br />

P6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

UF tb<br />

P<br />

i<br />

UF<br />

t<br />

P<br />

UF<br />

P<br />

U<br />

3<br />

2<br />

3<br />

b2<br />

i<br />

3<br />

tb<br />

2<br />

i<br />

F3<br />

tb<br />

P2<br />

i<br />

U<br />

F1<br />

tb<br />

P6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

F<br />

1<br />

tb<br />

P<br />

6<br />

i<br />

U<br />

Ft<br />

b<br />

P<br />

i<br />

UF<br />

t<br />

P<br />

UF<br />

P<br />

U<br />

SP2<br />

Instruction / Constant L1<br />

HW Multi-threaded<br />

Scheduler<br />

Unified General Purpose<br />

Register File<br />

S<br />

p<br />

e<br />

c<br />

i<br />

a<br />

l<br />

F<br />

u<br />

n<br />

c<br />

Shared Memory<br />

t<br />

i<br />

o<br />

n<br />

SP4 F<br />

Instruction / Constant L1<br />

P<br />

U<br />

HW Multi-threaded<br />

Scheduler<br />

Unified General Purpose<br />

Register File<br />

S<br />

p<br />

e<br />

c<br />

i<br />

a<br />

l<br />

F<br />

u<br />

n<br />

c<br />

t<br />

i<br />

o<br />

n<br />

F<br />

P<br />

U<br />

Shared Memory<br />

Load Balancing<br />

Multiple device workload distribution<br />

Low Light<br />

Enhancement<br />

Game<br />

Rendering<br />

GPGPU<br />

Graphics<br />

Processor<br />

Adreno<br />

Graphics<br />

Font<br />

Rendering<br />

Surface<br />

Compositing<br />

GPGPU OpenVG Vector<br />

Graphics<br />

Processor<br />

zapper troll lea<br />

Game<br />

Physics<br />

Web Page<br />

Layout<br />

Game Scene<br />

Graph<br />

Parsing<br />

Multi-core<br />

CPU<br />

Processors<br />

<strong>OpenCL</strong> Load<br />

Balancing<br />

Web based game<br />

application with video<br />

conferencing example<br />

Bayer<br />

Filtering<br />

Video format<br />

Conversion<br />

VFE Image<br />

Processor<br />

Video conference game example<br />

OpenVG is a trademark of the Khronos Group Inc.<br />

Incoming<br />

Video<br />

Decode<br />

Video CODEC<br />

Processor<br />

Background<br />

Noise Filter<br />

QDSP Signal<br />

Processor<br />

14<br />

<strong>Qualcomm</strong> Proprietary


View of Next Generation <strong>OpenCL</strong><br />

* * * * *<br />

CU- 3D (SIMD)<br />

+ * -<br />

CU – QDSP/ARM<br />

-<br />

Semaphores<br />

Semaphores<br />

Queue<br />

Communication Fabric<br />

<strong>Qualcomm</strong> Wait2<br />

Enhanced Send1 Task3 ARM<br />

Task2 Task1 Wait1<br />

CPU<br />

Send2<br />

Host<br />

Semaphor res<br />

Semaphores<br />

HW -Video<br />

15<br />

<strong>Qualcomm</strong> Proprietary


<strong>OpenCL</strong> System Overview<br />

How it works<br />

16<br />

<strong>Qualcomm</strong> Proprietary


CPUs<br />

Multiple cores driving<br />

performance increases<br />

Emerging<br />

g<br />

Intersection<br />

GPGPUs<br />

Increasingly general purpose<br />

data-parallel l computing<br />

Improving numerical precision<br />

Multi-processor<br />

programming –<br />

e.g. OpenMP<br />

<strong>OpenCL</strong><br />

Heterogenous<br />

<strong>Computing</strong><br />

Graphics APIs<br />

and Shading<br />

Languages<br />

OpenMP is a trademark of he OpenMP Architecture Review Board<br />

<strong>OpenCL</strong> – Open <strong>Computing</strong> Language<br />

Open, royalty-free standard for portable, parallel programming of heterogeneous<br />

parallel computing CPUs, GPUs, and other processors<br />

Source: www.khronos.org<br />

17<br />

<strong>Qualcomm</strong> Proprietary


<strong>OpenCL</strong> Still Requires Smart Developers to<br />

Leverage Compute Devices Effectively<br />

CPUs are<br />

• Good at complex logic/branching<br />

• Have several cores<br />

• Good at very long programs<br />

• Don’t have image filtering<br />

hardware<br />

• Don’t have local memory<br />

• Have hierarchical memory cache<br />

• Good for random data<br />

GPUs are<br />

• Bad at complex logic/branching<br />

• Have hundreds of cores<br />

• Bad at long programs<br />

• Have efficient image filtering<br />

hardware<br />

• Have local on-chip memory<br />

• Have simple or no memory cache<br />

• Good for streaming data<br />

• Many practical algorithms require some of both – simple highly parallel steps and<br />

steps that require very long programs with complex logic and branching<br />

• It would be natural to implement such algorithms on a combination of CPU and<br />

GPGPU, each doing parts at which it is really good<br />

18<br />

<strong>Qualcomm</strong> Proprietary


However <strong>OpenCL</strong> Provides for Consistent APIs<br />

• Need to unify the concept of a “computing circuit”<br />

it”<br />

– CPU and GPU cores are apples and oranges, need some common concept<br />

– Came up with a common concept of an <strong>OpenCL</strong> Device, though still kept<br />

three distinct device classes based on market reality:<br />

• CL_CPU (<strong>Qualcomm</strong> Enhanced ARM CPU, including Enhanced NEON)<br />

• CL_ACCELERATOR (QDSP)<br />

• CL_GPU (Adreno Graphics)<br />

• Emerging “Fixed Functionality Device<br />

– such as video decoding hardware, camera demosaic hardware<br />

• Need to unify memory classes<br />

– Global memory<br />

– <strong>On</strong>-chip memory caches<br />

– <strong>On</strong>-chip shared memory<br />

– Register files<br />

19<br />

<strong>Qualcomm</strong> Proprietary


Anatomy of an <strong>OpenCL</strong> Platform<br />

• A host computer can have one or more <strong>OpenCL</strong> devices<br />

• Each <strong>OpenCL</strong> device has one or more compute units<br />

• Each compute unit has one or more processing elements<br />

<strong>OpenCL</strong> devices<br />

20<br />

<strong>Qualcomm</strong> Proprietary


<strong>OpenCL</strong> Memory Model<br />

• Memory Types<br />

– Private: on-chip, temp register file<br />

– Local: on-chip, read-write, shared within a compute<br />

unit<br />

– Constant: on-chip memory, read-only, shared within<br />

compute unit<br />

– Global: system memory, shared between all<br />

compute units, can be on host or device<br />

• Memory model<br />

– <strong>OpenCL</strong> provides barrier mechanism for<br />

synchronization<br />

o Memory is assumed to be undefined across work items<br />

unless explicitly synchronized<br />

– Multiple distinct address spaces<br />

– Address spaces can be collapsed depending on the<br />

device’s memory subsystem<br />

• Diagram shows typical memory layout for a<br />

GPU architecture<br />

t<br />

Private<br />

Memory<br />

Private<br />

Memory<br />

Private<br />

Memory<br />

Private<br />

Memory<br />

ALU #1 ALU #M ALU #1 ALU #M<br />

Local Memory<br />

Compute Unit 1<br />

Compute Device<br />

Local Memory<br />

Compute Unit N<br />

Global / Constant Memory Data Cache<br />

Global Memory<br />

Compute Device Memory<br />

21<br />

<strong>Qualcomm</strong> Proprietary


Memory specifics for various devices<br />

• CPU generally doesn’t have physical local l memory, local l memory can be<br />

simulated using global<br />

• Fixed functionality devices (such as video decoder) might only have global<br />

memory access)<br />

• GPU might have dedicated global memory or have global memory shared with<br />

CPU<br />

• Global memory cache is optional; however, if present, global cache layout is<br />

visible to applications (<strong>OpenCL</strong> query for size and line size)<br />

• Minimum requirements for embedded systems:<br />

– Local memory:1 KB per compute unit<br />

– Const on-chip memory: 4 KB per compute unit<br />

– Global on-chip cache: Optional<br />

– Global memory: 1MB<br />

22<br />

<strong>Qualcomm</strong> Proprietary


<strong>OpenCL</strong> data set decomposition into<br />

work groups & work items<br />

A WORK ITEM is a point in a<br />

data set<br />

WORK GROUPS made of<br />

adjacent WORK ITEMS; WORK<br />

GROUPS share data through local<br />

memory<br />

<strong>On</strong>e WORK GROUP at<br />

a time per compute unit<br />

23<br />

<strong>Qualcomm</strong> Proprietary


Concurrent applications running on heterogeneous<br />

multi-core platform<br />

24<br />

<strong>Qualcomm</strong> Proprietary


Questions<br />

End<br />

25<br />

<strong>Qualcomm</strong> Proprietary


Disclaimer<br />

Nothing in these materials is an offer to sell any of the components or devices referenced<br />

herein. Certain components for use in the U.S. are available only through licensed suppliers.<br />

Some components are not available for use in the U.S.<br />

26<br />

<strong>Qualcomm</strong> Proprietary<br />

3/3/09


Trademarks<br />

<strong>Qualcomm</strong> is a registered trademark of <strong>Qualcomm</strong> Incorporated in the United States and<br />

may be registered in other countries. UPLINQ is a trademark of <strong>Qualcomm</strong> Incorporated.<br />

Other product and brand names may be trademarks or registered trademarks of their<br />

respective owners<br />

27<br />

<strong>Qualcomm</strong> Proprietary

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!