Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...
Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...
Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
simulator. The simulator is compatible to <strong>3GPP</strong> specification<br />
defined in 36.211 and 36.212 [2]. The <strong>3GPP</strong> SCME model<br />
[10] is used as the channel model. In the simulation, 5000<br />
subframes have been simulated. 64-QAM modulation and 5/6<br />
coding rate is used. S<strong>of</strong>t-output sphere decoder which<br />
achieves ML per<strong>for</strong>mance is used as the detection method. No<br />
close-loop precoding is assumed. At most three<br />
retransmissions are allowed in the CC based H-ARQ.<br />
Simulation results are depicted in Fig. 6 and Fig. 7. Note that<br />
in the simulation, only 5MHz band is used instead <strong>of</strong> 20MHz<br />
band <strong>for</strong> the maximum throughput. The link-level BER and<br />
throughput figures show that the per<strong>for</strong>mance degradation<br />
introduced by the approximations (e.g. the NII based<br />
approximation) adopted in the parallel decoder is very small.<br />
TABLE II<br />
COMPARISON OF ASIC IMPLEMENTATION RESULTS<br />
<strong>Turbo</strong> <strong>Decoder</strong> This Work [4]<br />
Technology CMOS 65nm CMOS 130nm<br />
Supply Voltage 1.2V 1.2V<br />
Logic Size 70kgate 44.1kgate<br />
Total Memory 250kb 120kb<br />
Total Area 0.7mm 2 1.2mm 2<br />
Working Frequency 250MHz 246MHz<br />
Throughput 152Mbps 10.8Mbps<br />
Estimated Power 650mW 32.4mW<br />
Silicon Area/Throughput 0.005 mm 2 /Mbps 0.11 mm 2 /Mbps<br />
Power Efficiency 0.71nJ/b/iter 0.54nJ/b/iter<br />
IV. ASIC IMPLEMENTATION<br />
The <strong>Turbo</strong> decoder is implemented using 65nm CMOS<br />
process from STMicroelectronics. After placement and<br />
routing, it easily runs at 250MHz with a core area <strong>of</strong> 0.7mm 2 .<br />
The throughput <strong>of</strong> the Radix-2 parallel decoder can be<br />
computed as:<br />
Sblk,max<br />
× fclk<br />
T =<br />
(3)<br />
S + S + C × × N<br />
( )<br />
pw,max sw extra 2 ite<br />
Sblk<br />
,max<br />
Nsiso<br />
blk<br />
where S pw,max<br />
= . Here S<br />
,max<br />
= 6144 is the maximum<br />
number <strong>of</strong> bits per codeword defined in [2], N siso = 8 is the<br />
number <strong>of</strong> SISO units, S sw = 64 is the sliding window size,<br />
C extra = 10 is the number <strong>of</strong> overhead cycles, fclk<br />
= 250MHz<br />
is<br />
the clock frequency and N ite is the number <strong>of</strong> decoding<br />
iterations. The throughput <strong>of</strong> the decoder is 152Mbit/s in case<br />
N ite = 6 . In high SNR region, early stopping can effectively<br />
reduce the number <strong>of</strong> iterations needed to four. This gives a<br />
throughput <strong>of</strong> 228Mbit/s. Theoretically, <strong>for</strong> 20MHz<br />
bandwidth with 2× 2 Spatial Multiplexing and 64-QAM, the<br />
required throughput <strong>of</strong> <strong>Turbo</strong> decoding is 176Mbit/s, which<br />
means the presented implementation is sufficient <strong>for</strong> CAT4<br />
PDSCH decoding. Owing to the scalability <strong>of</strong> the on-chip<br />
network in [3], the <strong>Turbo</strong> decoder can be easily wrapped and<br />
integrated with the programmable plat<strong>for</strong>m. TABLE II depicts<br />
the ASIC implementation result in comparison to the prior-art<br />
presented in [4]. The result shows that work in this paper<br />
achieves more than 10 times data throughput compared to [4]<br />
while achieving higher silicon efficiency and slightly lower<br />
power efficiency even without the voltage scaling used in [4].<br />
V. CONCLUSION<br />
In this paper, a 650mW <strong>Turbo</strong> decoder is presented <strong>for</strong><br />
CAT4 LTE terminals. The design is based on<br />
algorithm/architecture co-optimization targeting a good trade<strong>of</strong>f<br />
between per<strong>for</strong>mance and throughput. The result shows<br />
that at feasible silicon cost, the parallel architecture<br />
significantly reduces the decoding latency while allowing a<br />
link-level per<strong>for</strong>mance that is close to the traditional serial<br />
decoder to be achieved.<br />
Fig. 8.<br />
S8<br />
Interleaver<br />
S7<br />
S2<br />
S1<br />
Control<br />
Layout <strong>of</strong> the <strong>Turbo</strong> <strong>Decoder</strong><br />
S6<br />
S3<br />
VI. ACKNOWLEDGMENT<br />
The authors would like to thank Christian Mehlführer and<br />
the Christian Doppler Laboratory <strong>for</strong> Design Methodology <strong>of</strong><br />
Signal Processing Algorithms at Vienna University <strong>of</strong><br />
Technology, <strong>for</strong> contributions on the simulators.<br />
REFERENCES<br />
[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit<br />
error-correcting coding and decoding: <strong>Turbo</strong> Codes,” in Proc. IEEE<br />
ICC, 1993, pp. 1064-1070.<br />
[2] <strong>3GPP</strong>, “Tech. Spec. 36.212 V8.4.0, E-UTRA; Multiplexing and Channel<br />
Coding,” Sept 2008.<br />
[3] Nilsson, E. Tell, and D. Liu, “An 11 mm 2 70 mW Fully-Programmable<br />
Baseband Processor <strong>for</strong> Mobile WiMAX and DVB-T/H in 0.12µm<br />
CMOS,” in IEEE Journal <strong>of</strong> Solid-State Circuits, vol. 44, no. 1, pp. 90-<br />
97, 2009.<br />
[4] Benkeser, A. Burg, T. Cupaiuolo, Q. Huang, “Design and Optimization<br />
<strong>of</strong> an HSDPA <strong>Turbo</strong> <strong>Decoder</strong> ASIC,” in IEEE Journal <strong>of</strong> Solid-State<br />
Circuits, vol. 44, no. 1, pp. 98-106, 2009.<br />
[5] S.Yoon, and Y. Bar-Ness, “A <strong>Parallel</strong> MAP Algorithm <strong>for</strong> Low Latency<br />
<strong>Turbo</strong> Decoding,” in IEEE Communication Letters, vol 6, pp. 288-290,<br />
July, 2002.<br />
[6] M.C. Shin and I.C. Park. “SIMD processor-based <strong>Turbo</strong> decoder<br />
supporting multiple third-generation wireless standards,” in IEEE Trans.<br />
on VLSI, vol.15, pp. 801-810, June 2007.<br />
[7] A.Dingninou, F. Raouafi, and C. Berrou, “Organisation de la memoire<br />
dans un turbo decodeur utilisant l'algorithm SUB-MAP,” in Proceedings<br />
<strong>of</strong> Gretsi, pp. 71-74, Sept. 1999.<br />
[8] R. Asghar, D. Liu, “Dual standard re-configurable hardware interleaver<br />
<strong>for</strong> turbo decoding,” in Proceedings <strong>of</strong> IEEE ISWPC, May 2008, pp. 768<br />
- 772.<br />
[9] R. Asghar, D. Wu, J. Eilert and D. Liu, “Memory Conflict Analysis and<br />
<strong>Implementation</strong> <strong>of</strong> a Re-configurable Interleaver Architecture<br />
Supporting Unified <strong>Parallel</strong> <strong>Turbo</strong> Decoding,” in Journal <strong>of</strong> Signal<br />
Processing Systems; DOI: 10.1007/s11265-009-0394-8. To Appear.<br />
[10] D. S. Baum, J. Salo, M. Milojevic, P. Kyöti, and J. Hansen “MATLAB<br />
implementation <strong>of</strong> the interim channel model <strong>for</strong> beyond-3G systems<br />
(SCME),” May 2005.<br />
S5<br />
S4