Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

simulator. The simulator is compatible to 3GPP specification 

defined in 36.211 and 36.212 [2]. The 3GPP SCME model 

[10] is used as the channel model. In the simulation, 5000 

subframes have been simulated. 64-QAM modulation and 5/6 

coding rate is used. Soft-output sphere decoder which 

achieves ML performance is used as the detection method. No 

close-loop precoding is assumed. At most three 

retransmissions are allowed in the CC based H-ARQ. 

Simulation results are depicted in Fig. 6 and Fig. 7. Note that 

in the simulation, only 5MHz band is used instead of 20MHz 

band for the maximum throughput. The link-level BER and 

throughput figures show that the performance degradation 

introduced by the approximations (e.g. the NII based 

approximation) adopted in the parallel decoder is very small. 

TABLE II 

COMPARISON OF ASIC IMPLEMENTATION RESULTS 

Turbo Decoder This Work [4] 

Technology CMOS 65nm CMOS 130nm 

Supply Voltage 1.2V 1.2V 

Logic Size 70kgate 44.1kgate 

Total Memory 250kb 120kb 

Total Area 0.7mm 2 1.2mm 2 

Working Frequency 250MHz 246MHz 

Throughput 152Mbps 10.8Mbps 

Estimated Power 650mW 32.4mW 

Silicon Area/Throughput 0.005 mm 2 /Mbps 0.11 mm 2 /Mbps 

Power Efficiency 0.71nJ/b/iter 0.54nJ/b/iter 

IV. ASIC IMPLEMENTATION 

The Turbo decoder is implemented using 65nm CMOS 

process from STMicroelectronics. After placement and 

routing, it easily runs at 250MHz with a core area of 0.7mm 2 . 

The throughput of the Radix-2 parallel decoder can be 

computed as: 

Sblk,max 

× fclk 

T = 

(3) 

S + S + C × × N 

( ) 

pw,max sw extra 2 ite 

Sblk 

,max 

Nsiso 

blk 

where S pw,max 

= . Here S 

,max 

= 6144 is the maximum 

number of bits per codeword defined in [2], N siso = 8 is the 

number of SISO units, S sw = 64 is the sliding window size, 

C extra = 10 is the number of overhead cycles, fclk 

= 250MHz 

is 

the clock frequency and N ite is the number of decoding 

iterations. The throughput of the decoder is 152Mbit/s in case 

N ite = 6 . In high SNR region, early stopping can effectively 

reduce the number of iterations needed to four. This gives a 

throughput of 228Mbit/s. Theoretically, for 20MHz 

bandwidth with 2× 2 Spatial Multiplexing and 64-QAM, the 

required throughput of Turbo decoding is 176Mbit/s, which 

means the presented implementation is sufficient for CAT4 

PDSCH decoding. Owing to the scalability of the on-chip 

network in [3], the Turbo decoder can be easily wrapped and 

integrated with the programmable platform. TABLE II depicts 

the ASIC implementation result in comparison to the prior-art 

presented in [4]. The result shows that work in this paper 

achieves more than 10 times data throughput compared to [4] 

while achieving higher silicon efficiency and slightly lower 

power efficiency even without the voltage scaling used in [4]. 

V. CONCLUSION 

In this paper, a 650mW Turbo decoder is presented for 

CAT4 LTE terminals. The design is based on 

algorithm/architecture co-optimization targeting a good tradeoff 

between performance and throughput. The result shows 

that at feasible silicon cost, the parallel architecture 

significantly reduces the decoding latency while allowing a 

link-level performance that is close to the traditional serial 

decoder to be achieved. 

Fig. 8. 

S8 

Interleaver 

S7 

S2 

S1 

Control 

Layout of the Turbo Decoder 

S6 

S3 

VI. ACKNOWLEDGMENT 

The authors would like to thank Christian Mehlführer and 

the Christian Doppler Laboratory for Design Methodology of 

Signal Processing Algorithms at Vienna University of 

Technology, for contributions on the simulators. 

REFERENCES 

[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit 

error-correcting coding and decoding: Turbo Codes,” in Proc. IEEE 

ICC, 1993, pp. 1064-1070. 

[2] 3GPP, “Tech. Spec. 36.212 V8.4.0, E-UTRA; Multiplexing and Channel 

Coding,” Sept 2008. 

[3] Nilsson, E. Tell, and D. Liu, “An 11 mm 2 70 mW Fully-Programmable 

Baseband Processor for Mobile WiMAX and DVB-T/H in 0.12µm 

CMOS,” in IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 90- 

97, 2009. 

[4] Benkeser, A. Burg, T. Cupaiuolo, Q. Huang, “Design and Optimization 

of an HSDPA Turbo Decoder ASIC,” in IEEE Journal of Solid-State 

Circuits, vol. 44, no. 1, pp. 98-106, 2009. 

[5] S.Yoon, and Y. Bar-Ness, “A Parallel MAP Algorithm for Low Latency 

Turbo Decoding,” in IEEE Communication Letters, vol 6, pp. 288-290, 

July, 2002. 

[6] M.C. Shin and I.C. Park. “SIMD processor-based Turbo decoder 

supporting multiple third-generation wireless standards,” in IEEE Trans. 

on VLSI, vol.15, pp. 801-810, June 2007. 

[7] A.Dingninou, F. Raouafi, and C. Berrou, “Organisation de la memoire 

dans un turbo decodeur utilisant l'algorithm SUB-MAP,” in Proceedings 

of Gretsi, pp. 71-74, Sept. 1999. 

[8] R. Asghar, D. Liu, “Dual standard re-configurable hardware interleaver 

for turbo decoding,” in Proceedings of IEEE ISWPC, May 2008, pp. 768 

- 772. 

[9] R. Asghar, D. Wu, J. Eilert and D. Liu, “Memory Conflict Analysis and 

Implementation of a Re-configurable Interleaver Architecture 

Supporting Unified Parallel Turbo Decoding,” in Journal of Signal 

Processing Systems; DOI: 10.1007/s11265-009-0394-8. To Appear. 

[10] D. S. Baum, J. Salo, M. Milojevic, P. Kyöti, and J. Hansen “MATLAB 

implementation of the interim channel model for beyond-3G systems 

(SCME),” May 2005. 

S5 

S4

Previous page

Next page

1

2

3

4

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

Create successful ePaper yourself

Delete template?

Save as template?