29.10.2014 Views

GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600 ...

GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600 ...

GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600 ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>GPU</str<strong>on</strong>g> <str<strong>on</strong>g>Compute</str<strong>on</strong>g> <str<strong>on</strong>g>accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> <str<strong>on</strong>g>decoder</str<strong>on</strong>g><br />

<strong>on</strong> ARM® Mali TM -<strong>T600</strong> <str<strong>on</strong>g>GPU</str<strong>on</strong>g>s


Ittiam Systems Introducti<strong>on</strong><br />

DSP Systems IP Company<br />

Multimedia + Communicati<strong>on</strong> Systems<br />

Multimedia Comp<strong>on</strong>ents, Systems, Hardware<br />

Focus <strong>on</strong> Broadcast, Video Communicati<strong>on</strong>, Video Security, Mobile<br />

IP Licensing Business Model<br />

Founded in 2001<br />

Venture funded<br />

Flexible mix of <strong>on</strong>e time fees and royalties for licensing<br />

300+ licensees<br />

Worldwide<br />

Fortune 100 companies, Tier 1 OEMs<br />

C<strong>on</strong>sistently rated as Most Preferred DSP IP Supplier<br />

250 str<strong>on</strong>g Engineering Team<br />

World Class Talent<br />

Deep Multimedia and end applicati<strong>on</strong> Expertise<br />

29 patents issued 30+ patents filed<br />

World’s most preferred DSP IP supplier<br />

2004 • 2005 • 2006<br />

DSP Professi<strong>on</strong>als Survey by Forward C<strong>on</strong>cepts<br />

2


Ittiam Multimedia Overview<br />

Multimedia Comp<strong>on</strong>ents<br />

Audio Codecs<br />

Video Codecs/Image Codecs<br />

Algorithms for Audio Effects, Acoustics, Imaging<br />

ARM CPU , NEON Optimized<br />

DSP+HW Accelerators + <str<strong>on</strong>g>GPU</str<strong>on</strong>g> expertise and capabilities<br />

Middleware + SDKs<br />

System comp<strong>on</strong>ents Parsers, Creators, Stacks, Subtitles<br />

Multimedia Integrati<strong>on</strong> Android, Other Frameworks<br />

Use Case validati<strong>on</strong><br />

Enhancements to existing Middleware<br />

Applicati<strong>on</strong> Specific SDKs<br />

OEM Applicati<strong>on</strong>s<br />

Complete Multimedia Applicati<strong>on</strong>s<br />

Covers major Multimedia Use Cases<br />

Camera, Gallery, Editor, Players, Video Editor<br />

Producti<strong>on</strong> tested<br />

Customizable to requirements<br />

4x<br />

3


Ittiam Multimedia Soluti<strong>on</strong>s and ARM<br />

Strategic Platform<br />

Focus <strong>on</strong> Mobile, Home, Portable segments<br />

ARM C<strong>on</strong>nected Community Member<br />

Str<strong>on</strong>g Portfolio of IP<br />

Expertise in ARM architecture and optimizati<strong>on</strong>s for ARM<br />

L<strong>on</strong>g Investment<br />

Many years of development <strong>on</strong> ARM Platforms<br />

Covering ARM9E, ARM11, Cortex®-A8, A9, A15, A5, A7 and NEON TM<br />

In house developed reference C models for all IP<br />

Efficient, targeted for ARM, validated across multiple generati<strong>on</strong>s<br />

Partnership<br />

Joint Benchmarking of implementati<strong>on</strong>s<br />

Early Access to Mali/OpenCL informati<strong>on</strong><br />

Early involvement <strong>on</strong> new platforms<br />

4


Ittiam Media Processing Elements<br />

Audio Codes<br />

Stereo and Multichannel<br />

MP12, AAC- LC/HE v1&v2, AC3, DD+<br />

High Quality Resampler<br />

Post Processing and Audio Effects<br />

Field Proven<br />

Acoustics<br />

Voice Quality Enhancements with Echo<br />

Cancellati<strong>on</strong>/ AEC), Noise Reducti<strong>on</strong>/ANR<br />

Equalizer for Microph<strong>on</strong>e Sin & Speaker<br />

AGC , AVC , Audio De-Reverb<br />

Mic Beam Forming<br />

Video Codecs<br />

MPEG2, MPEG-4, H.264 , <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> / H.265<br />

Scalable across Multiple ARM Cores<br />

Optimized for bandwidth and CPU + NEON<br />

Error Resilience for Streaming Use cases<br />

In Producti<strong>on</strong><br />

Image Processing<br />

De-noise, Face detecti<strong>on</strong>, Red-eye<br />

correcti<strong>on</strong><br />

Panorama, HDR, Low Light, 3D<br />

B&W, Sepia, Cross Process<br />

Exposure, Colours, Geometric, Filters<br />

5


<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Overview


<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> / H.265 Sandard<br />

<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> aka H.265 is a video compressi<strong>on</strong> standard, jointly<br />

developed by ISO/IEC MPEG and ITU-T VCEG<br />

MPEG and VCEG have established a Joint Collaborative Team<br />

<strong>on</strong> Video Coding (JCT-VC) to develop the <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> standard<br />

<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> is a successor to H.264 standard<br />

<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> can support ultra high resoluti<strong>on</strong>s upto 8192 x 4320<br />

pixels<br />

<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> offers substantially higher video compressi<strong>on</strong> ratio<br />

compared to existing standards


H.265 vs H.264<br />

Tool H.264 H.265<br />

Coding unit<br />

16x16 macroblocks<br />

Block coding Structure<br />

Coding tree blocks (64x64)<br />

Quadtree coding structure<br />

Transforms 4x4 and 8x8 4x4, 8x8, 16x16 and 32x32<br />

Inter Predicti<strong>on</strong><br />

4x4 to 16x16<br />

Symmetric partiti<strong>on</strong>s<br />

Intra Predicti<strong>on</strong> 9 Modes 35 Modes<br />

4x4 to 64x64<br />

Asymmetric partiti<strong>on</strong>s<br />

Moti<strong>on</strong> Predicti<strong>on</strong> Spatial Median Advanced Moti<strong>on</strong> Vecti<strong>on</strong> Predicti<strong>on</strong><br />

(Spatial + Temporal)<br />

Luma moti<strong>on</strong><br />

compensati<strong>on</strong><br />

Chroma moti<strong>on</strong><br />

compensati<strong>on</strong><br />

6 taps for half-pel positi<strong>on</strong>s+<br />

Bilinear filter for qpel positi<strong>on</strong>s<br />

2 taps 4 taps<br />

8 taps for half-pixel positi<strong>on</strong>s + 7 tap<br />

filter for quarter-pel positi<strong>on</strong>s<br />

Slices Slices for parallel parsing Wavefr<strong>on</strong>t parallel processing<br />

Tiles and slices for parallel parsing<br />

In-loop filters Deblocking Deblocking and SAO


BitRate<br />

<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> compressi<strong>on</strong><br />

About 50% compressi<strong>on</strong> over H264 for video resoluti<strong>on</strong>s of<br />

1080p and above. 30-40% compressi<strong>on</strong> over H264 for lower<br />

resoluti<strong>on</strong>s<br />

MPEG-2<br />

35% reducti<strong>on</strong> in bitrate for same<br />

PSNR output when compared to<br />

H.264<br />

H264/AVC<br />

Perceptual video quality is subjective<br />

and cannot be measured with PSNR<br />

values<br />

1990 2000 2010<br />

H265/<str<strong>on</strong>g>HEVC</str<strong>on</strong>g><br />

Subjective tests have shown around<br />

50% reducti<strong>on</strong> in bitrate for similar<br />

perceptual video quality when<br />

compared to H.264


<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Applicati<strong>on</strong>s – Near Term<br />

Over-the-top(OTT) video services market is growing at a rapid pace,<br />

thanks to Netflix, Hulu, YouTube etc.,<br />

Smarter Ph<strong>on</strong>es and Tablets c<strong>on</strong>tribute significantly to OTT growth with<br />

c<strong>on</strong>sumers opting to view videos <strong>on</strong>-the-go<br />

OTT video services are popularly used with in TVs/set-top boxes as well<br />

Rapid growth in OTT market chokes the network<br />

bandwidth<br />

One in five C<strong>on</strong>sumers aband<strong>on</strong> viewing due to<br />

slow feeds , poor quality viewing experience<br />

<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> will enable superior viewing experience with<br />

OTT video service


<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Applicati<strong>on</strong>s – L<strong>on</strong>g Term<br />

Higher quality video in the traditi<strong>on</strong>al terrestrial<br />

and satellite broadcasts<br />

Video recording in cameras and mobile ph<strong>on</strong>es,<br />

for saving storage space or higher quality<br />

Broadcasting 1080p video at 50 or 60<br />

frames per sec<strong>on</strong>d for the same<br />

bandwidth as 1080i (25 or 30 fps)<br />

4K and 8K Ultra-HD broadcasts for<br />

theatre-like quality


Need for Software <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Decoder<br />

<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> is a newly ratified standard and there is no hardware support in the<br />

current generati<strong>on</strong> of Processors (Embedded / Mobile / Applicati<strong>on</strong>s SoCs)<br />

Dedicated HW accelerators for <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> increases the silic<strong>on</strong> area and hence<br />

the cost significantly<br />

Lack of <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> c<strong>on</strong>tent makes the early HW implementati<strong>on</strong> risky<br />

Software Decoding is simpler and ec<strong>on</strong>omically viable opti<strong>on</strong> for <str<strong>on</strong>g>HEVC</str<strong>on</strong>g><br />

deployment NOW<br />

Handling the <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> <str<strong>on</strong>g>decoder</str<strong>on</strong>g> complexity <strong>on</strong> a wide range of<br />

processors with c<strong>on</strong>straints <strong>on</strong> the power c<strong>on</strong>sumpti<strong>on</strong> is<br />

key challenge for the Software Decoder


Why use <str<strong>on</strong>g>GPU</str<strong>on</strong>g>s for Video Processing ?<br />

Decoding of high resoluti<strong>on</strong> videos in software<br />

involves high computati<strong>on</strong>al complexity and will load<br />

the CPU enormously<br />

<str<strong>on</strong>g>GPU</str<strong>on</strong>g>s are highly compute capable and power<br />

efficient devices<br />

Sin<br />

CPU<br />

Core(s)<br />

ARM Cortex<br />

with NEON<br />

<str<strong>on</strong>g>GPU</str<strong>on</strong>g>s are generally idle during video playout<br />

<str<strong>on</strong>g>GPU</str<strong>on</strong>g> accelerati<strong>on</strong> will free up the CPU to perform<br />

other (system) tasks<br />

MALI <strong>T600</strong> / OpenCL<br />

compliant <str<strong>on</strong>g>GPU</str<strong>on</strong>g>


<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Decoding <strong>on</strong> Capable <str<strong>on</strong>g>GPU</str<strong>on</strong>g>s<br />

<str<strong>on</strong>g>GPU</str<strong>on</strong>g>s are massively multithreaded devices capable of handling hundreds or<br />

thousands of threads in parallel at any given time<br />

Only highly data parallel algorithms of video codec can be efficiently<br />

offloaded to the <str<strong>on</strong>g>GPU</str<strong>on</strong>g> for processing<br />

Parsing &<br />

Entropy<br />

Decode<br />

Moti<strong>on</strong><br />

Compensa<br />

ti<strong>on</strong><br />

Inverse<br />

Quant<br />

Intra<br />

Predicti<strong>on</strong><br />

Inverse<br />

Transform<br />

Rec<strong>on</strong><br />

Deblocking<br />

& SAO<br />

Not suitable for <str<strong>on</strong>g>GPU</str<strong>on</strong>g> executi<strong>on</strong><br />

Data parallel executi<strong>on</strong> ,suitable for <str<strong>on</strong>g>GPU</str<strong>on</strong>g> executi<strong>on</strong>


Moti<strong>on</strong> Compensati<strong>on</strong><br />

The current picture/frame<br />

pixels is predicted from the<br />

reference frame’s pixels<br />

The reference picture can be<br />

from past or future<br />

The predicti<strong>on</strong> happens <strong>on</strong> a<br />

block-by-block basis<br />

Sin<br />

And there can be multiple<br />

reference frames for each<br />

block


Moti<strong>on</strong> Compensati<strong>on</strong><br />

The most compute intensive part of Moti<strong>on</strong> compensati<strong>on</strong> is sub-pixel<br />

interpolati<strong>on</strong><br />

• Luma – 8 or 7 tap filter<br />

• Chroma – 4 tap filter<br />

Sub pixel interpolati<strong>on</strong> is data parallel, i.e., interpolati<strong>on</strong> of each block<br />

within a frame can happen in parallel and hence suited for <str<strong>on</strong>g>GPU</str<strong>on</strong>g> computing<br />

Sin


Inverse Quantizati<strong>on</strong> and Transform<br />

Inverse Quantizati<strong>on</strong> & Transform<br />

• The residue value need to be Inverse quantized<br />

• 2-D Inverse DCT transformati<strong>on</strong>s should be performed over the<br />

inverse quantized data<br />

Rec<strong>on</strong> & InLoop Filters<br />

• Rec<strong>on</strong>structi<strong>on</strong> : The output from the Moti<strong>on</strong> compensati<strong>on</strong> and<br />

intra predicti<strong>on</strong> should be added with the output from Inverse<br />

transform<br />

• In loop filtering such as Deblocking and SAO filters are applied over<br />

rec<strong>on</strong>structed samples<br />

Parsing &<br />

Entropy<br />

Decode<br />

Moti<strong>on</strong><br />

Compensati<br />

<strong>on</strong><br />

Inverse<br />

Quant<br />

Intra<br />

Predicti<strong>on</strong><br />

Sin<br />

Inverse<br />

Transform<br />

Rec<strong>on</strong><br />

Deblocking &<br />

SAO


Challenges in CPU+<str<strong>on</strong>g>GPU</str<strong>on</strong>g> Implementati<strong>on</strong><br />

Efficient Partiti<strong>on</strong>ing<br />

of work between<br />

CPU and <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />

• The effective FPS of <str<strong>on</strong>g>decoder</str<strong>on</strong>g> will be the minimum of the FPS<br />

achieved by the CPU and <str<strong>on</strong>g>GPU</str<strong>on</strong>g> for their respective work<br />

• So the partiti<strong>on</strong>ing needs to be efficient so that both of them<br />

perform their respective work at almost the same speed(FPS)<br />

Efficient pipelining<br />

data between CPU<br />

and <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />

• The algorithms running <strong>on</strong> CPU will depend <strong>on</strong> the output of<br />

algorithms from <str<strong>on</strong>g>GPU</str<strong>on</strong>g> and/or vice versa<br />

• A good design should make sure neither the CPU nor the <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />

spend any time waiting for the output of the other<br />

Cache coherency<br />

• Cache coherency between CPU and <str<strong>on</strong>g>GPU</str<strong>on</strong>g> data need to<br />

ensured.


Benefits of Mali <strong>T600</strong> <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />

The 128-bit vector<br />

processing<br />

Presence of <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />

cache instead of Local<br />

memory<br />

Flexible OpenCL<br />

workgroup size<br />

No divergent threads<br />

Unified memory<br />

• Suits DSP algorithms like Video processing<br />

• No requirement for data transfers from/to global memory. Can<br />

be understood just like a CPU.<br />

• Works optimizally for a large range of OpenCL workgroup sizes.<br />

Multiple block sizes in a Video frame can be handled efficiently.<br />

• Similar to CPU code, c<strong>on</strong>diti<strong>on</strong>al code can be used in OpenCL<br />

kernels as well. Different kinds of filter types, filter lengths etc.,<br />

in video decode can be handled efficiently.<br />

• CPU and <str<strong>on</strong>g>GPU</str<strong>on</strong>g> share the same memory. Video YUV buffers are<br />

pretty big. There is no need of costly memory transfers of those<br />

buffers.<br />

MALI <str<strong>on</strong>g>GPU</str<strong>on</strong>g>s are well suited for Video Accelerati<strong>on</strong><br />

with significant power/performance benefits


Thank You<br />

For more informati<strong>on</strong> visit www.ittiam.com<br />

or c<strong>on</strong>tact us at mkt@ittiam.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!