01.02.2015 Views

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

simulator. The simulator is compatible to <strong>3GPP</strong> specification<br />

defined in 36.211 and 36.212 [2]. The <strong>3GPP</strong> SCME model<br />

[10] is used as the channel model. In the simulation, 5000<br />

subframes have been simulated. 64-QAM modulation and 5/6<br />

coding rate is used. S<strong>of</strong>t-output sphere decoder which<br />

achieves ML per<strong>for</strong>mance is used as the detection method. No<br />

close-loop precoding is assumed. At most three<br />

retransmissions are allowed in the CC based H-ARQ.<br />

Simulation results are depicted in Fig. 6 and Fig. 7. Note that<br />

in the simulation, only 5MHz band is used instead <strong>of</strong> 20MHz<br />

band <strong>for</strong> the maximum throughput. The link-level BER and<br />

throughput figures show that the per<strong>for</strong>mance degradation<br />

introduced by the approximations (e.g. the NII based<br />

approximation) adopted in the parallel decoder is very small.<br />

TABLE II<br />

COMPARISON OF ASIC IMPLEMENTATION RESULTS<br />

<strong>Turbo</strong> <strong>Decoder</strong> This Work [4]<br />

Technology CMOS 65nm CMOS 130nm<br />

Supply Voltage 1.2V 1.2V<br />

Logic Size 70kgate 44.1kgate<br />

Total Memory 250kb 120kb<br />

Total Area 0.7mm 2 1.2mm 2<br />

Working Frequency 250MHz 246MHz<br />

Throughput 152Mbps 10.8Mbps<br />

Estimated Power 650mW 32.4mW<br />

Silicon Area/Throughput 0.005 mm 2 /Mbps 0.11 mm 2 /Mbps<br />

Power Efficiency 0.71nJ/b/iter 0.54nJ/b/iter<br />

IV. ASIC IMPLEMENTATION<br />

The <strong>Turbo</strong> decoder is implemented using 65nm CMOS<br />

process from STMicroelectronics. After placement and<br />

routing, it easily runs at 250MHz with a core area <strong>of</strong> 0.7mm 2 .<br />

The throughput <strong>of</strong> the Radix-2 parallel decoder can be<br />

computed as:<br />

Sblk,max<br />

× fclk<br />

T =<br />

(3)<br />

S + S + C × × N<br />

( )<br />

pw,max sw extra 2 ite<br />

Sblk<br />

,max<br />

Nsiso<br />

blk<br />

where S pw,max<br />

= . Here S<br />

,max<br />

= 6144 is the maximum<br />

number <strong>of</strong> bits per codeword defined in [2], N siso = 8 is the<br />

number <strong>of</strong> SISO units, S sw = 64 is the sliding window size,<br />

C extra = 10 is the number <strong>of</strong> overhead cycles, fclk<br />

= 250MHz<br />

is<br />

the clock frequency and N ite is the number <strong>of</strong> decoding<br />

iterations. The throughput <strong>of</strong> the decoder is 152Mbit/s in case<br />

N ite = 6 . In high SNR region, early stopping can effectively<br />

reduce the number <strong>of</strong> iterations needed to four. This gives a<br />

throughput <strong>of</strong> 228Mbit/s. Theoretically, <strong>for</strong> 20MHz<br />

bandwidth with 2× 2 Spatial Multiplexing and 64-QAM, the<br />

required throughput <strong>of</strong> <strong>Turbo</strong> decoding is 176Mbit/s, which<br />

means the presented implementation is sufficient <strong>for</strong> CAT4<br />

PDSCH decoding. Owing to the scalability <strong>of</strong> the on-chip<br />

network in [3], the <strong>Turbo</strong> decoder can be easily wrapped and<br />

integrated with the programmable plat<strong>for</strong>m. TABLE II depicts<br />

the ASIC implementation result in comparison to the prior-art<br />

presented in [4]. The result shows that work in this paper<br />

achieves more than 10 times data throughput compared to [4]<br />

while achieving higher silicon efficiency and slightly lower<br />

power efficiency even without the voltage scaling used in [4].<br />

V. CONCLUSION<br />

In this paper, a 650mW <strong>Turbo</strong> decoder is presented <strong>for</strong><br />

CAT4 LTE terminals. The design is based on<br />

algorithm/architecture co-optimization targeting a good trade<strong>of</strong>f<br />

between per<strong>for</strong>mance and throughput. The result shows<br />

that at feasible silicon cost, the parallel architecture<br />

significantly reduces the decoding latency while allowing a<br />

link-level per<strong>for</strong>mance that is close to the traditional serial<br />

decoder to be achieved.<br />

Fig. 8.<br />

S8<br />

Interleaver<br />

S7<br />

S2<br />

S1<br />

Control<br />

Layout <strong>of</strong> the <strong>Turbo</strong> <strong>Decoder</strong><br />

S6<br />

S3<br />

VI. ACKNOWLEDGMENT<br />

The authors would like to thank Christian Mehlführer and<br />

the Christian Doppler Laboratory <strong>for</strong> Design Methodology <strong>of</strong><br />

Signal Processing Algorithms at Vienna University <strong>of</strong><br />

Technology, <strong>for</strong> contributions on the simulators.<br />

REFERENCES<br />

[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit<br />

error-correcting coding and decoding: <strong>Turbo</strong> Codes,” in Proc. IEEE<br />

ICC, 1993, pp. 1064-1070.<br />

[2] <strong>3GPP</strong>, “Tech. Spec. 36.212 V8.4.0, E-UTRA; Multiplexing and Channel<br />

Coding,” Sept 2008.<br />

[3] Nilsson, E. Tell, and D. Liu, “An 11 mm 2 70 mW Fully-Programmable<br />

Baseband Processor <strong>for</strong> Mobile WiMAX and DVB-T/H in 0.12µm<br />

CMOS,” in IEEE Journal <strong>of</strong> Solid-State Circuits, vol. 44, no. 1, pp. 90-<br />

97, 2009.<br />

[4] Benkeser, A. Burg, T. Cupaiuolo, Q. Huang, “Design and Optimization<br />

<strong>of</strong> an HSDPA <strong>Turbo</strong> <strong>Decoder</strong> ASIC,” in IEEE Journal <strong>of</strong> Solid-State<br />

Circuits, vol. 44, no. 1, pp. 98-106, 2009.<br />

[5] S.Yoon, and Y. Bar-Ness, “A <strong>Parallel</strong> MAP Algorithm <strong>for</strong> Low Latency<br />

<strong>Turbo</strong> Decoding,” in IEEE Communication Letters, vol 6, pp. 288-290,<br />

July, 2002.<br />

[6] M.C. Shin and I.C. Park. “SIMD processor-based <strong>Turbo</strong> decoder<br />

supporting multiple third-generation wireless standards,” in IEEE Trans.<br />

on VLSI, vol.15, pp. 801-810, June 2007.<br />

[7] A.Dingninou, F. Raouafi, and C. Berrou, “Organisation de la memoire<br />

dans un turbo decodeur utilisant l'algorithm SUB-MAP,” in Proceedings<br />

<strong>of</strong> Gretsi, pp. 71-74, Sept. 1999.<br />

[8] R. Asghar, D. Liu, “Dual standard re-configurable hardware interleaver<br />

<strong>for</strong> turbo decoding,” in Proceedings <strong>of</strong> IEEE ISWPC, May 2008, pp. 768<br />

- 772.<br />

[9] R. Asghar, D. Wu, J. Eilert and D. Liu, “Memory Conflict Analysis and<br />

<strong>Implementation</strong> <strong>of</strong> a Re-configurable Interleaver Architecture<br />

Supporting Unified <strong>Parallel</strong> <strong>Turbo</strong> Decoding,” in Journal <strong>of</strong> Signal<br />

Processing Systems; DOI: 10.1007/s11265-009-0394-8. To Appear.<br />

[10] D. S. Baum, J. Salo, M. Milojevic, P. Kyöti, and J. Hansen “MATLAB<br />

implementation <strong>of</strong> the interim channel model <strong>for</strong> beyond-3G systems<br />

(SCME),” May 2005.<br />

S5<br />

S4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!