10.07.2015 Views

Error Tolerant Multimedia Stream Processing: There's ... - IEEE Xplore

Error Tolerant Multimedia Stream Processing: There's ... - IEEE Xplore

Error Tolerant Multimedia Stream Processing: There's ... - IEEE Xplore

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

296 <strong>IEEE</strong> TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 2, FEBRUARY 2013tasks, protecting the data-memory space using error controlcoding (ECC) circuits is not important, as learning and recognitionalgorithms will scale their recognition rate gracefully inthepresenceofdataerrors.However, protection of data loopindexing and instruction memory variables is important aserrors in the algorithm flow could cause severe distortions.III. SYSTEM-ORIENTED OVERVIEW OF ERROR-TOLERANTMULTIMEDIA STREAM PROCESSINGGiven the possibilities for resource/distortion scaling presentedfor the broad classes of multimedia stream processingapplications of the previous section, we elaborate on the differentpossibilities of practical system adjustment for resourcescaling and resilience under approximations of outputs or thepotential existence of processing errors stemming from thesoftware, scheduling, architecture, or hardware layers. Unlikethe previous section that focused on particular classes ofmultimedia stream processing algorithms, here we summarizeresearch advances in system layers, from the software layer tothe architecture, device and IC layers.A. Advances in Approximate (Precision-Aware) <strong>Multimedia</strong><strong>Stream</strong> <strong>Processing</strong> Software ComponentsIn the software layer, the de-facto standard libraries for highperformanceMSP today are the Basic Linear Algebra Subprograms(BLAS), the Linear Algebra Package (LAPACK) anddigital signal processing or computer vision libraries. Severaloptimized designs exist for these, focusing on single-instruction-multiple-dataarchitectures (e.g., NAG, Goto [68], ATLAS,Intel IPP, OpenCV, Matlab, and AMD ACML), manycore platforms(e.g., Matlab Parallel Computing Toolbox, PLASMA/MAGMA [69]), and embedded processors (e.g., Eigen for ARMNeon).All high-performance realizations of MSP algorithms tryto make maximum usage of computational kernels of suchlibraries to ensure their realization remains as modular andhighly-optimized as possible. This modularization also allowsfor automated design flows that can optimally selecthardware resources from a multi-component architecture fordata-dominated MSP algorithm flows [17], [18], [70]. Finally,the importance of this modularization of algorithmic componentsinto high-performance computational kernels of softwarelibraries is demonstrated by the recent definition of the reconfigurablevideo coding specification [71] within MPEG videocoding standards.We first review conventional acceleration and fault-toleranceapproaches and then present some emerging ideas onthroughput-distortion scaling of computation for linear algebraoperations that form the compute- and memory-intensive softwarekernels (primitives) of the error-tolerant MSP applicationsof the previous section.1) Acceleration Techniques and Conventional Fault Tolerance:Acceleration techniques for these libraries generallyfollow two different approaches: tailoring and customizing thecomputational routines to particular systems (e.g., to exploitproblem-specific sparsity for sparse linear solvers) [72], orexploiting mixed precision for accelerated processing in certaincomputations, e.g., for iterative linear solvers [73]. In bothcases, notable accelerations have been reported via the use ofmanycore platforms, hybrid multicore/manycore systems, orreconfigurable systems [18]. State-of-the-art designs today use<strong>Stream</strong>ing SIMD Extensions (SSE) for multicore processorsand automated parallelization within the compilation andscheduling, e.g., via the CUDA and OpenCL frameworks.In terms of tolerance to hardware-induced errors, conventionalfault tolerance techniques follow ECC approaches in software[74], [75] or hardware [55], [76]. These approaches createredundant computations composed of mixtures of the existinginput data in a linear operation (e.g., matrix product or convolution)and form checkpoints from the redundant computationsin order to check the useful results for correctness using paritybits or Hamming codes. They generally incur 70 150% performancepenalty in terms of the achieved giga floating-pointoperations per second (GFLOPS) [75], [76] and they can onlydetect and correct a limited number of errors per computation,typically one or two errors per input substream of data. Complementaryto such general-purpose fault-tolerant approaches,error tolerance in software or hardware for MSP was proposedby Breuer, Ortega et al. [77]–[79] by exploiting the fact thatmultimedia operations such as motion estimation and filteringtend to create localized errors that can be maskedfromtheuserifdetected or treated properly. As an example, Chung and Ortega[78] analyzed the motion estimation of an MPEG encoder basedon an RTL implementation of the algorithm and concluded thatmost hardware-induced faults either did not create an error in theoutput or they only led to bandwidth increases due to erroneousmotion vector estimation or redundant frame transmissions.While these approaches for acceleration and error toleranceare certainly encouraging, they still fail to capture the massiveparallelism offered by the hardware platforms themselves or,for mixed-precision methods, the speed-vs.-accuracy trade-offbetween single and double-precision floating point. In fact, itis becoming all the more evident that massively parallel architecturestoday are significantly underutilized [80]. To quantifythis, one can assess the efficiency of a particular computationvia the peak performance figure of merit. Peak performancecan be defined as the ratio between the practically-obtainedprocessing throughput over the maximum-possible processingthroughput for a particular SP algorithm on a given platform.While well-known linear algebra libraries today achieveover 0.9 (or 90%) peak performance in single cores [68], thisfigure is significantly lower for multicore and manycore processors.For example, the latest performance figures for manycoreprocessors from NVIDIA indicate peak performance of0.52 (52%) or less for basic matrix multiplication routines [80].This means that (at least) 48% of the available processing cycles’budget is wasted in data I/O between cores and memory,cache misses and stall cycles. This substantial drop of peak performanceillustrates the difficulty in exploiting multicore andmanycore processors to their limit.2) Throughput-Distortion Computation via Adaptive Compandingand Packing: Ideas have begun to emerge recently onprecision-adaptive MSP with respect to throughput-distortionscaling of linear algebra and signal processing operations [6],[7], [81]–[85]. Fig. 4 illustrates how such approaches fit withinthe system overview of MSP applications of Fig. 1. The keyprinciple of such frameworks comprises two steps:

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!