GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600 ...
GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600 ...
GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600 ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<str<strong>on</strong>g>GPU</str<strong>on</strong>g> <str<strong>on</strong>g>Compute</str<strong>on</strong>g> <str<strong>on</strong>g>accelerated</str<strong>on</strong>g> <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> <str<strong>on</strong>g>decoder</str<strong>on</strong>g><br />
<strong>on</strong> ARM® Mali TM -<strong>T600</strong> <str<strong>on</strong>g>GPU</str<strong>on</strong>g>s
Ittiam Systems Introducti<strong>on</strong><br />
DSP Systems IP Company<br />
Multimedia + Communicati<strong>on</strong> Systems<br />
Multimedia Comp<strong>on</strong>ents, Systems, Hardware<br />
Focus <strong>on</strong> Broadcast, Video Communicati<strong>on</strong>, Video Security, Mobile<br />
IP Licensing Business Model<br />
Founded in 2001<br />
Venture funded<br />
Flexible mix of <strong>on</strong>e time fees and royalties for licensing<br />
300+ licensees<br />
Worldwide<br />
Fortune 100 companies, Tier 1 OEMs<br />
C<strong>on</strong>sistently rated as Most Preferred DSP IP Supplier<br />
250 str<strong>on</strong>g Engineering Team<br />
World Class Talent<br />
Deep Multimedia and end applicati<strong>on</strong> Expertise<br />
29 patents issued 30+ patents filed<br />
World’s most preferred DSP IP supplier<br />
2004 • 2005 • 2006<br />
DSP Professi<strong>on</strong>als Survey by Forward C<strong>on</strong>cepts<br />
2
Ittiam Multimedia Overview<br />
Multimedia Comp<strong>on</strong>ents<br />
Audio Codecs<br />
Video Codecs/Image Codecs<br />
Algorithms for Audio Effects, Acoustics, Imaging<br />
ARM CPU , NEON Optimized<br />
DSP+HW Accelerators + <str<strong>on</strong>g>GPU</str<strong>on</strong>g> expertise and capabilities<br />
Middleware + SDKs<br />
System comp<strong>on</strong>ents Parsers, Creators, Stacks, Subtitles<br />
Multimedia Integrati<strong>on</strong> Android, Other Frameworks<br />
Use Case validati<strong>on</strong><br />
Enhancements to existing Middleware<br />
Applicati<strong>on</strong> Specific SDKs<br />
OEM Applicati<strong>on</strong>s<br />
Complete Multimedia Applicati<strong>on</strong>s<br />
Covers major Multimedia Use Cases<br />
Camera, Gallery, Editor, Players, Video Editor<br />
Producti<strong>on</strong> tested<br />
Customizable to requirements<br />
4x<br />
3
Ittiam Multimedia Soluti<strong>on</strong>s and ARM<br />
Strategic Platform<br />
Focus <strong>on</strong> Mobile, Home, Portable segments<br />
ARM C<strong>on</strong>nected Community Member<br />
Str<strong>on</strong>g Portfolio of IP<br />
Expertise in ARM architecture and optimizati<strong>on</strong>s for ARM<br />
L<strong>on</strong>g Investment<br />
Many years of development <strong>on</strong> ARM Platforms<br />
Covering ARM9E, ARM11, Cortex®-A8, A9, A15, A5, A7 and NEON TM<br />
In house developed reference C models for all IP<br />
Efficient, targeted for ARM, validated across multiple generati<strong>on</strong>s<br />
Partnership<br />
Joint Benchmarking of implementati<strong>on</strong>s<br />
Early Access to Mali/OpenCL informati<strong>on</strong><br />
Early involvement <strong>on</strong> new platforms<br />
4
Ittiam Media Processing Elements<br />
Audio Codes<br />
Stereo and Multichannel<br />
MP12, AAC- LC/HE v1&v2, AC3, DD+<br />
High Quality Resampler<br />
Post Processing and Audio Effects<br />
Field Proven<br />
Acoustics<br />
Voice Quality Enhancements with Echo<br />
Cancellati<strong>on</strong>/ AEC), Noise Reducti<strong>on</strong>/ANR<br />
Equalizer for Microph<strong>on</strong>e Sin & Speaker<br />
AGC , AVC , Audio De-Reverb<br />
Mic Beam Forming<br />
Video Codecs<br />
MPEG2, MPEG-4, H.264 , <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> / H.265<br />
Scalable across Multiple ARM Cores<br />
Optimized for bandwidth and CPU + NEON<br />
Error Resilience for Streaming Use cases<br />
In Producti<strong>on</strong><br />
Image Processing<br />
De-noise, Face detecti<strong>on</strong>, Red-eye<br />
correcti<strong>on</strong><br />
Panorama, HDR, Low Light, 3D<br />
B&W, Sepia, Cross Process<br />
Exposure, Colours, Geometric, Filters<br />
5
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Overview
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> / H.265 Sandard<br />
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> aka H.265 is a video compressi<strong>on</strong> standard, jointly<br />
developed by ISO/IEC MPEG and ITU-T VCEG<br />
MPEG and VCEG have established a Joint Collaborative Team<br />
<strong>on</strong> Video Coding (JCT-VC) to develop the <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> standard<br />
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> is a successor to H.264 standard<br />
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> can support ultra high resoluti<strong>on</strong>s upto 8192 x 4320<br />
pixels<br />
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> offers substantially higher video compressi<strong>on</strong> ratio<br />
compared to existing standards
H.265 vs H.264<br />
Tool H.264 H.265<br />
Coding unit<br />
16x16 macroblocks<br />
Block coding Structure<br />
Coding tree blocks (64x64)<br />
Quadtree coding structure<br />
Transforms 4x4 and 8x8 4x4, 8x8, 16x16 and 32x32<br />
Inter Predicti<strong>on</strong><br />
4x4 to 16x16<br />
Symmetric partiti<strong>on</strong>s<br />
Intra Predicti<strong>on</strong> 9 Modes 35 Modes<br />
4x4 to 64x64<br />
Asymmetric partiti<strong>on</strong>s<br />
Moti<strong>on</strong> Predicti<strong>on</strong> Spatial Median Advanced Moti<strong>on</strong> Vecti<strong>on</strong> Predicti<strong>on</strong><br />
(Spatial + Temporal)<br />
Luma moti<strong>on</strong><br />
compensati<strong>on</strong><br />
Chroma moti<strong>on</strong><br />
compensati<strong>on</strong><br />
6 taps for half-pel positi<strong>on</strong>s+<br />
Bilinear filter for qpel positi<strong>on</strong>s<br />
2 taps 4 taps<br />
8 taps for half-pixel positi<strong>on</strong>s + 7 tap<br />
filter for quarter-pel positi<strong>on</strong>s<br />
Slices Slices for parallel parsing Wavefr<strong>on</strong>t parallel processing<br />
Tiles and slices for parallel parsing<br />
In-loop filters Deblocking Deblocking and SAO
BitRate<br />
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> compressi<strong>on</strong><br />
About 50% compressi<strong>on</strong> over H264 for video resoluti<strong>on</strong>s of<br />
1080p and above. 30-40% compressi<strong>on</strong> over H264 for lower<br />
resoluti<strong>on</strong>s<br />
MPEG-2<br />
35% reducti<strong>on</strong> in bitrate for same<br />
PSNR output when compared to<br />
H.264<br />
H264/AVC<br />
Perceptual video quality is subjective<br />
and cannot be measured with PSNR<br />
values<br />
1990 2000 2010<br />
H265/<str<strong>on</strong>g>HEVC</str<strong>on</strong>g><br />
Subjective tests have shown around<br />
50% reducti<strong>on</strong> in bitrate for similar<br />
perceptual video quality when<br />
compared to H.264
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Applicati<strong>on</strong>s – Near Term<br />
Over-the-top(OTT) video services market is growing at a rapid pace,<br />
thanks to Netflix, Hulu, YouTube etc.,<br />
Smarter Ph<strong>on</strong>es and Tablets c<strong>on</strong>tribute significantly to OTT growth with<br />
c<strong>on</strong>sumers opting to view videos <strong>on</strong>-the-go<br />
OTT video services are popularly used with in TVs/set-top boxes as well<br />
Rapid growth in OTT market chokes the network<br />
bandwidth<br />
One in five C<strong>on</strong>sumers aband<strong>on</strong> viewing due to<br />
slow feeds , poor quality viewing experience<br />
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> will enable superior viewing experience with<br />
OTT video service
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Applicati<strong>on</strong>s – L<strong>on</strong>g Term<br />
Higher quality video in the traditi<strong>on</strong>al terrestrial<br />
and satellite broadcasts<br />
Video recording in cameras and mobile ph<strong>on</strong>es,<br />
for saving storage space or higher quality<br />
Broadcasting 1080p video at 50 or 60<br />
frames per sec<strong>on</strong>d for the same<br />
bandwidth as 1080i (25 or 30 fps)<br />
4K and 8K Ultra-HD broadcasts for<br />
theatre-like quality
Need for Software <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Decoder<br />
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> is a newly ratified standard and there is no hardware support in the<br />
current generati<strong>on</strong> of Processors (Embedded / Mobile / Applicati<strong>on</strong>s SoCs)<br />
Dedicated HW accelerators for <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> increases the silic<strong>on</strong> area and hence<br />
the cost significantly<br />
Lack of <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> c<strong>on</strong>tent makes the early HW implementati<strong>on</strong> risky<br />
Software Decoding is simpler and ec<strong>on</strong>omically viable opti<strong>on</strong> for <str<strong>on</strong>g>HEVC</str<strong>on</strong>g><br />
deployment NOW<br />
Handling the <str<strong>on</strong>g>HEVC</str<strong>on</strong>g> <str<strong>on</strong>g>decoder</str<strong>on</strong>g> complexity <strong>on</strong> a wide range of<br />
processors with c<strong>on</strong>straints <strong>on</strong> the power c<strong>on</strong>sumpti<strong>on</strong> is<br />
key challenge for the Software Decoder
Why use <str<strong>on</strong>g>GPU</str<strong>on</strong>g>s for Video Processing ?<br />
Decoding of high resoluti<strong>on</strong> videos in software<br />
involves high computati<strong>on</strong>al complexity and will load<br />
the CPU enormously<br />
<str<strong>on</strong>g>GPU</str<strong>on</strong>g>s are highly compute capable and power<br />
efficient devices<br />
Sin<br />
CPU<br />
Core(s)<br />
ARM Cortex<br />
with NEON<br />
<str<strong>on</strong>g>GPU</str<strong>on</strong>g>s are generally idle during video playout<br />
<str<strong>on</strong>g>GPU</str<strong>on</strong>g> accelerati<strong>on</strong> will free up the CPU to perform<br />
other (system) tasks<br />
MALI <strong>T600</strong> / OpenCL<br />
compliant <str<strong>on</strong>g>GPU</str<strong>on</strong>g>
<str<strong>on</strong>g>HEVC</str<strong>on</strong>g> Decoding <strong>on</strong> Capable <str<strong>on</strong>g>GPU</str<strong>on</strong>g>s<br />
<str<strong>on</strong>g>GPU</str<strong>on</strong>g>s are massively multithreaded devices capable of handling hundreds or<br />
thousands of threads in parallel at any given time<br />
Only highly data parallel algorithms of video codec can be efficiently<br />
offloaded to the <str<strong>on</strong>g>GPU</str<strong>on</strong>g> for processing<br />
Parsing &<br />
Entropy<br />
Decode<br />
Moti<strong>on</strong><br />
Compensa<br />
ti<strong>on</strong><br />
Inverse<br />
Quant<br />
Intra<br />
Predicti<strong>on</strong><br />
Inverse<br />
Transform<br />
Rec<strong>on</strong><br />
Deblocking<br />
& SAO<br />
Not suitable for <str<strong>on</strong>g>GPU</str<strong>on</strong>g> executi<strong>on</strong><br />
Data parallel executi<strong>on</strong> ,suitable for <str<strong>on</strong>g>GPU</str<strong>on</strong>g> executi<strong>on</strong>
Moti<strong>on</strong> Compensati<strong>on</strong><br />
The current picture/frame<br />
pixels is predicted from the<br />
reference frame’s pixels<br />
The reference picture can be<br />
from past or future<br />
The predicti<strong>on</strong> happens <strong>on</strong> a<br />
block-by-block basis<br />
Sin<br />
And there can be multiple<br />
reference frames for each<br />
block
Moti<strong>on</strong> Compensati<strong>on</strong><br />
The most compute intensive part of Moti<strong>on</strong> compensati<strong>on</strong> is sub-pixel<br />
interpolati<strong>on</strong><br />
• Luma – 8 or 7 tap filter<br />
• Chroma – 4 tap filter<br />
Sub pixel interpolati<strong>on</strong> is data parallel, i.e., interpolati<strong>on</strong> of each block<br />
within a frame can happen in parallel and hence suited for <str<strong>on</strong>g>GPU</str<strong>on</strong>g> computing<br />
Sin
Inverse Quantizati<strong>on</strong> and Transform<br />
Inverse Quantizati<strong>on</strong> & Transform<br />
• The residue value need to be Inverse quantized<br />
• 2-D Inverse DCT transformati<strong>on</strong>s should be performed over the<br />
inverse quantized data<br />
Rec<strong>on</strong> & InLoop Filters<br />
• Rec<strong>on</strong>structi<strong>on</strong> : The output from the Moti<strong>on</strong> compensati<strong>on</strong> and<br />
intra predicti<strong>on</strong> should be added with the output from Inverse<br />
transform<br />
• In loop filtering such as Deblocking and SAO filters are applied over<br />
rec<strong>on</strong>structed samples<br />
Parsing &<br />
Entropy<br />
Decode<br />
Moti<strong>on</strong><br />
Compensati<br />
<strong>on</strong><br />
Inverse<br />
Quant<br />
Intra<br />
Predicti<strong>on</strong><br />
Sin<br />
Inverse<br />
Transform<br />
Rec<strong>on</strong><br />
Deblocking &<br />
SAO
Challenges in CPU+<str<strong>on</strong>g>GPU</str<strong>on</strong>g> Implementati<strong>on</strong><br />
Efficient Partiti<strong>on</strong>ing<br />
of work between<br />
CPU and <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />
• The effective FPS of <str<strong>on</strong>g>decoder</str<strong>on</strong>g> will be the minimum of the FPS<br />
achieved by the CPU and <str<strong>on</strong>g>GPU</str<strong>on</strong>g> for their respective work<br />
• So the partiti<strong>on</strong>ing needs to be efficient so that both of them<br />
perform their respective work at almost the same speed(FPS)<br />
Efficient pipelining<br />
data between CPU<br />
and <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />
• The algorithms running <strong>on</strong> CPU will depend <strong>on</strong> the output of<br />
algorithms from <str<strong>on</strong>g>GPU</str<strong>on</strong>g> and/or vice versa<br />
• A good design should make sure neither the CPU nor the <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />
spend any time waiting for the output of the other<br />
Cache coherency<br />
• Cache coherency between CPU and <str<strong>on</strong>g>GPU</str<strong>on</strong>g> data need to<br />
ensured.
Benefits of Mali <strong>T600</strong> <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />
The 128-bit vector<br />
processing<br />
Presence of <str<strong>on</strong>g>GPU</str<strong>on</strong>g><br />
cache instead of Local<br />
memory<br />
Flexible OpenCL<br />
workgroup size<br />
No divergent threads<br />
Unified memory<br />
• Suits DSP algorithms like Video processing<br />
• No requirement for data transfers from/to global memory. Can<br />
be understood just like a CPU.<br />
• Works optimizally for a large range of OpenCL workgroup sizes.<br />
Multiple block sizes in a Video frame can be handled efficiently.<br />
• Similar to CPU code, c<strong>on</strong>diti<strong>on</strong>al code can be used in OpenCL<br />
kernels as well. Different kinds of filter types, filter lengths etc.,<br />
in video decode can be handled efficiently.<br />
• CPU and <str<strong>on</strong>g>GPU</str<strong>on</strong>g> share the same memory. Video YUV buffers are<br />
pretty big. There is no need of costly memory transfers of those<br />
buffers.<br />
MALI <str<strong>on</strong>g>GPU</str<strong>on</strong>g>s are well suited for Video Accelerati<strong>on</strong><br />
with significant power/performance benefits
Thank You<br />
For more informati<strong>on</strong> visit www.ittiam.com<br />
or c<strong>on</strong>tact us at mkt@ittiam.com