01.02.2015 Views

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Implementation</strong> <strong>of</strong> A <strong>High</strong>-<strong>Speed</strong> <strong>Parallel</strong><br />

<strong>Turbo</strong> <strong>Decoder</strong> <strong>for</strong> <strong>3GPP</strong> LTE Terminals<br />

Di Wu, Rizwan Asghar, Yulin Huang and Dake Liu<br />

Abstract — This paper presents a parameterized parallel<br />

<strong>Turbo</strong> decoder <strong>for</strong> <strong>3GPP</strong> LTE terminals. To support the high<br />

peak data-rate defined in the <strong>for</strong>thcoming <strong>3GPP</strong> LTE<br />

standard, <strong>Turbo</strong> decoder with a throughout beyond 150Mbit/s<br />

is needed as a key component <strong>of</strong> the radio baseband chip. By<br />

exploiting the trade<strong>of</strong>f <strong>of</strong> precision, speed and area<br />

consumption, a <strong>Turbo</strong> decoder with eight parallel SISO units<br />

is implemented to meet the throughput requirement. The turbo<br />

decoder is synthesized, placed and routed using 65nm CMOS<br />

technology. It achieves a throughput <strong>of</strong> 152Mbit/s and<br />

occupies an area <strong>of</strong> 0.7mm 2 with estimated power<br />

consumption being 650mW 1 .<br />

Index Terms — <strong>Turbo</strong>, <strong>3GPP</strong>, ASIC.<br />

I. INTRODUCTION<br />

The ever increasing need <strong>of</strong> high-speed mobile data service<br />

has fostered the deployment <strong>of</strong> various mobile broadband<br />

technologies such as 3rd Generation Partnership Project Long-<br />

Term Evolution (<strong>3GPP</strong>-LTE). In order to provide mobility as<br />

well as reliability <strong>of</strong> the data transmission over error-prone<br />

radio channels, <strong>for</strong>ward-error correction (FEC) codes, which<br />

add redundant in<strong>for</strong>mation be<strong>for</strong>e transmitting the data, are<br />

indispensable to achieve robust data transmission. Among<br />

them, <strong>Turbo</strong> codes [1] have been widely adopted by different<br />

standards owing to its capability <strong>of</strong> approaching the Shannon<br />

limit. Meanwhile, since the amount <strong>of</strong> operations involved in<br />

<strong>Turbo</strong> decoding is significant, it has always been the<br />

bottleneck <strong>of</strong> these wireless systems which are latency<br />

constrained.<br />

Hybrid Automatic Repeat ReQuest (H-ARQ) [2] has been<br />

introduced in LTE to improve the throughput by<br />

retransmitting corrupted packets and applying s<strong>of</strong>t-combining<br />

at the receiver side. With the aid <strong>of</strong> H-ARQ, link-level<br />

throughput can be significantly improved even in presence <strong>of</strong><br />

mild bit-error-rate (BER). There<strong>for</strong>e, when benchmarking the<br />

per<strong>for</strong>mance <strong>of</strong> <strong>Turbo</strong> decoding <strong>for</strong> design trade-<strong>of</strong>f<br />

(per<strong>for</strong>mance vs. complexity), it is not enough to only<br />

consider a plain AWGN channel without standard compliant<br />

parameters. Especially H-ARQ needs to be taken into<br />

consideration at the same time.<br />

1 The work is supported by European Commission through the EU-FP7<br />

Multi-base project with Ericsson AB, Infineon Austria AG, IMEC, Lund<br />

University and KU-Leuven.<br />

D. Wu, R. Asghar, and D. Liu are with Department <strong>of</strong> Electrical<br />

Engineering, Linköping University, 58183 Linköping, Sweden (e-mail:<br />

rizwan@isy.liu.se, diwu@isy.liu.se, dake@isy.liu.se)<br />

Y. Huang was with Department <strong>of</strong> Electrical Engineering, Linköping<br />

University, 58183 Linköping, Sweden.<br />

TABLE I<br />

SUPPORTED LTE UE CATEGORIES<br />

UE Category (CAT) 1 2 3 4<br />

Supported Bandwidth<br />

1,4,3,5,10,15,20 MHz<br />

Antenna Configurations<br />

Up to 2x2 SM and SFBC<br />

Num <strong>of</strong> Layers <strong>for</strong> SM 1 2<br />

Max num <strong>of</strong> S<strong>of</strong>t-bits 250368 1237248 1237248 1827072<br />

DL peak rate (Mbit/s) 10 50 100 150<br />

UL peak rate (Mbit/s) 5 25 50 50<br />

In [3] a programmable baseband processor is presented to<br />

accommodate the symbol processing <strong>of</strong> various emerging<br />

wireless standards (e.g. LTE and WiMAX) by simply<br />

reloading the firmware. However, computational intensive<br />

parts such as the <strong>Turbo</strong> decoder need to be implemented as<br />

ASIC <strong>for</strong> the sake <strong>of</strong> silicon and power efficiency. In this<br />

paper, a parallel <strong>Turbo</strong> decoder is implemented as an<br />

accelerator attached to the baseband processor to meet the<br />

physical layer requirements <strong>of</strong> up to Category 4 user<br />

equipment (UE) as listed in TABLE I. Standard compliant<br />

link-level simulation is carried out to verify the per<strong>for</strong>mance<br />

<strong>of</strong> the decoder with H-ARQ taken into account. The remainder<br />

<strong>of</strong> the paper is organized as follows. In Section II the<br />

architecture <strong>of</strong> the decoder is presented. Section III presents<br />

the simulation per<strong>for</strong>mance <strong>of</strong> the <strong>Turbo</strong> decoder. The ASIC<br />

implementation is presented in Section IV. Finally, Section V<br />

concludes the paper.<br />

II. PARALLEL TURBO DECODER ARCHITECTURE<br />

According to the definition <strong>of</strong> LTE parameters in TABLE I,<br />

up to 150Mbit/s peak data rate <strong>for</strong> downlink needs to be<br />

supported. With the state-<strong>of</strong>-the-art <strong>Turbo</strong> decoder (e.g. [4])<br />

that contains a single s<strong>of</strong>t-input s<strong>of</strong>t-output (SISO), the<br />

maximum throughput that can be achieved with a clock<br />

frequency <strong>of</strong> 300MHz is around 25Mbit/s. Hence a new<br />

parallel <strong>Turbo</strong> decoder is needed to supply higher data rates.<br />

The way to achieve high data-rate <strong>Turbo</strong> decoding is to<br />

parallelize the decoder efficiently while maintaining sufficient<br />

decoding per<strong>for</strong>mance. However, the existing WCDMA<br />

interleaver will incur memory conflicts when several SISO<br />

units are in parallel accessing different parts <strong>of</strong> the coded<br />

block which prohibits the straight<strong>for</strong>ward parallelization.<br />

Fortunately, a contention-free interleaver has been introduced<br />

in LTE [2] which allows multiple SISO to access different<br />

sub-blocks in parallel. This brings the opportunity <strong>of</strong> lowcomplexity<br />

parallel implementation <strong>of</strong> the decoder.<br />

Although <strong>Turbo</strong> decoders based on SIMD architectures [6]<br />

exist, they are not competitive compared to dedicated


implementations. Furthermore, since the variation <strong>of</strong> <strong>Turbo</strong><br />

decoding procedure among different standards is small, a<br />

parameterized <strong>Turbo</strong> decoding accelerator is sufficient to<br />

achieve enough flexibility while maintaining area and power<br />

efficiency. In order to supply 150Mbit/s peak data rate<br />

required by LTE UE category 4, a parallel <strong>Turbo</strong> decoder with<br />

eight SISO units has been designed with its architecture<br />

depicted in Fig. 1.<br />

values at the beginning and end <strong>of</strong> sliding windows and<br />

parallel windows need to be saved which means the overhead<br />

is limited.<br />

A. SISO Unit<br />

SISO unit is the major block in a <strong>Turbo</strong> decoder. The input to<br />

s<br />

the SISO are: systematic a-prior in<strong>for</strong>mation λ a k , systematic<br />

intrinsic in<strong>for</strong>mation λ i s k and parity intrinsic in<strong>for</strong>mation λ i p<br />

k .<br />

In each SISO, classical sliding-window processing is applied<br />

to reduce the latency. In this paper, the size <strong>of</strong> the sliding<br />

window S sw is 64. Each SISO unit consists <strong>of</strong> α , β , γ and Log-<br />

Likelihood Ratio (LLR) units. Radix-2 log-Max decoding is<br />

used with the scaling <strong>of</strong> the extrinsic in<strong>for</strong>mation which<br />

allows a close log-MAP decoding per<strong>for</strong>mance to be<br />

achieved. The scaling factor ranges from 0.6 to 0.8.<br />

Fig. 1.<br />

Block Diagram <strong>of</strong> the <strong>Parallel</strong> <strong>Turbo</strong> <strong>Decoder</strong><br />

Fig. 3.<br />

Block Diagram <strong>of</strong> the SISO Unit<br />

Fig. 2.<br />

Scheduling Chart <strong>of</strong> a Half Iteration<br />

As defined in [2], the maximum code block size is 6144 in<br />

LTE which will incur very high latency when serial <strong>Turbo</strong><br />

decoder is used. A super windowing strategy [5] is adopted in<br />

this paper to speed up the decoding. First, the coded block will<br />

be partitioned into eight equally sized sub-blocks. Each SISO<br />

unit is responsible <strong>for</strong> the processing <strong>of</strong> one sub-block. The<br />

processing <strong>of</strong> one sub-block is considered as one parallel<br />

window with size S pw<br />

. Hence in total eight parallel windows<br />

can be processed in parallel, which reduced the processing<br />

time by eight times. The challenge when using parallel<br />

window is the absence <strong>of</strong> initial values at the border <strong>of</strong> each<br />

parallel window. As depicted in Fig. 1 and Fig. 3, a method<br />

namely Next Iteration Initialization (NII) [7] is used to<br />

estimate the initial values both sliding windows and parallel<br />

windows using the corresponding state metrics produced at<br />

previous iteration. This requires extra storage <strong>of</strong> α and β<br />

values. These values are saved in a first-in first-out (FIFO)<br />

between neighboring SISO units and the local buffer inside<br />

each SISO. There<strong>for</strong>e, the reduction <strong>of</strong> computational latency<br />

is at the cost <strong>of</strong> increased storage. Fortunately, only α and β<br />

• γ Unit: As depicted in Fig. 4(a), the γ units computes<br />

two transition metrics<br />

s s s s p<br />

γ10 = λak + λik , γ11<br />

= λak + λik + λik<br />

p<br />

(with γ 00 = 0 and γ 01 = λi k ) from the input.<br />

• α unit: α unit computes the <strong>for</strong>ward state metrics<br />

1<br />

( , ) , ( , )<br />

α m max<br />

b j m j b j m<br />

k = αk<br />

− 1<br />

γ k − 1<br />

j=<br />

0<br />

<strong>for</strong> each sub-block assigned to the SISO with size S pw<br />

. The<br />

result will be saved into a stack memory, from where, the β<br />

unit will read in a reversed order to compute the backward<br />

metrics. The α unit consists <strong>of</strong> mainly eight add-compareselect<br />

(ACS) units as depicted in Fig. 4(b).<br />

Fig. 4.<br />

γ and ACS Units<br />

• β unit: Similar to α unit, β unit also consists <strong>of</strong> eight<br />

ACS units. It computes the backward metrics<br />

1<br />

m j, m f ( j, m)<br />

k = max k k 1<br />

j=<br />

0<br />

β γ β +<br />

based on the result computed early by the γ unit. Here<br />

f ( jm , ) is the next state in the trellis when the input is j<br />

and current state being m. As NII method is used, the<br />

computed β values at the border <strong>of</strong> the sliding windows<br />

will be stored in a buffer to be used as initial value in the


jm f jm<br />

next iteration. Intermediate values γ , ( , )<br />

k β k + 1 which are<br />

computed half-way to get β m k will be passed directly to<br />

the LLR unit to compute the extrinsic in<strong>for</strong>mation Λ ( d k ).<br />

• LLR unit: It computes the extrinsic in<strong>for</strong>mation<br />

( m 1, m f (1, m ) 0, (0, )<br />

) ( m m f m )<br />

7 7<br />

Λ ( d ) = max α + γ + β − max α + γ + β<br />

Fig. 5.<br />

k k k k+ 1 k k k+<br />

1<br />

m= 0 m=<br />

0<br />

and the s<strong>of</strong>t-output based on the result from α unit and β<br />

unit. As the output <strong>of</strong> the SISO units, Λ ( d k ) is passed<br />

through the switch network to different memory banks <strong>for</strong><br />

later access by the SISO unit in the next iteration.<br />

0<br />

P1<br />

A1<br />

+<br />

–<br />

S1<br />

P2<br />

A2<br />

msb<br />

1<br />

0<br />

S2<br />

f2 >>1<br />

P3<br />

A3<br />

K<br />

AM<br />

+<br />

S3<br />

P4<br />

A4<br />

g(0)<br />

g_c<br />

1<br />

0<br />

R<br />

g_x_a<br />

S4<br />

P5<br />

<strong>Parallel</strong> Interleaver Architecture<br />

A5<br />

g_x_b<br />

g_x<br />

S5<br />

P6<br />

A6<br />

K<br />

AM<br />

+<br />

S6<br />

P7<br />

A7<br />

S<br />

R<br />

S7<br />

A<br />

P8<br />

A8<br />

achieved with the replication <strong>of</strong> the hardware <strong>for</strong> basic<br />

interleaver and getting support from a LUT providing the<br />

starting values. In this case a total <strong>of</strong> 32 additions are needed.<br />

However, <strong>for</strong> the sake <strong>of</strong> hardware reuse, part <strong>of</strong> the basic<br />

interleaver unit can be shared to generate multiple addresses at<br />

the same time. The optimized hardware uses 18 additions in<br />

total to generate 8 parallel interleaver addresses, thus saving<br />

14 additions. The hardware <strong>for</strong> parallel interleaver address<br />

generation <strong>for</strong> LTE is shown in Fig. 5.<br />

C. H-ARQ<br />

In LTE, two s<strong>of</strong>t-combining methods namely Chase<br />

Combining (CC) and Incremental Redundancy (IR) are<br />

supported. In this paper, only CC is considered. The base<br />

station retransmits the same packet when it receives the<br />

NACK signal from UE, and UE combines the LLR<br />

in<strong>for</strong>mation generated by the detector with those <strong>of</strong> the<br />

initially received packet. The gain <strong>of</strong> H-ARQ is at the cost <strong>of</strong><br />

larger s<strong>of</strong>t buffer used to store the demodulated s<strong>of</strong>t<br />

in<strong>for</strong>mation. In order to support the buffer size as large as<br />

1.8Mbits, high-density memory such as 1T-SRAM needs to be<br />

used. Since 1T-SRAM is different from the technology used<br />

<strong>for</strong> the decoder, it is not included in the results <strong>of</strong> this paper.<br />

B. <strong>Parallel</strong> Interleaver<br />

The internal interleaver involved with the <strong>Turbo</strong> Code in<br />

<strong>3GPP</strong>-LTE is based on quadratic permutation polynomial<br />

(QPP). Thanks to the QPP interleaver [2] which allows<br />

conflict-free parallel access, the implementation <strong>of</strong> the parallel<br />

interleaver is straight-<strong>for</strong>ward. In this paper, to support the<br />

parallel <strong>Turbo</strong> decoder, a configurable parallel interleaver is<br />

adopted. It can generate eight addresses in parallel <strong>for</strong> the data<br />

access <strong>of</strong> eight SISO units. QPP interleavers have very<br />

compact representation methodology and also inhibit a<br />

structure that allows the easy analysis <strong>for</strong> its properties. The<br />

interleaving function <strong>for</strong> turbo code is specified by the<br />

following quadratic permutation polynomial:<br />

2<br />

( )<br />

A( x) = f1ix+<br />

f2 i x % K<br />

(1)<br />

Here x = 0,1, 2, …( K − 1) , where K is the block size. This<br />

polynomial provides deterministic interleaver behavior <strong>for</strong><br />

different block sizes with appropriate values <strong>of</strong> f 1<br />

and f<br />

2<br />

.<br />

Direct implementation <strong>of</strong> the permutation polynomial given in<br />

eq. (1) is in-efficient due to multiplications, modulo function<br />

and bit growth problem. The simplified hardware solution is<br />

to use recursive approach adopted in [8] and [9]. Eq. (1) can<br />

be re-written <strong>for</strong> recursive computation as:<br />

( A )<br />

A( x+ 1) = ( x) + g( x) % K<br />

(2)<br />

Where g( x) = ( f1+ f2+ 2. f2. x)<br />

% K , which can also be<br />

computed recursively as g( x+ 1)<br />

= ( g( x) + 2. f2)<br />

% K . The two<br />

recursive terms I ( x+<br />

1)<br />

and g<br />

( x + 1)<br />

are easy to compute and they<br />

provide the basic interleaver functionality <strong>for</strong> LTE.<br />

Owing to the parallelism inherited in the QPP interleaver,<br />

the generation <strong>of</strong> parallel interleaver addresses can be<br />

Fig. 6.<br />

Fig. 7.<br />

BER<br />

10 −2<br />

10 −3<br />

10 −4<br />

10 −5<br />

10 −6<br />

uncoded BER<br />

coded BER <strong>Parallel</strong> <strong>Turbo</strong><br />

coded BER Serial <strong>Turbo</strong><br />

10 −7<br />

20 22 24 26 28 30 32<br />

10 −1 SNR [dB]<br />

Bit-Error-Rate (rate 0.926, 64-QAM)<br />

throughput [Mbit/s]<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

coded throughput Serial <strong>Turbo</strong><br />

coded throughput <strong>Parallel</strong> <strong>Turbo</strong><br />

uncoded throughput<br />

0<br />

20 22 24 26 28 30 32<br />

SNR [dB]<br />

Link-level Throughput (rate 0.926, 64-QAM)<br />

III. SIMULATION AND SYSTEM PERFORMANCE<br />

In order to benchmark the parallel <strong>Turbo</strong> decoder,<br />

simulation is carried out based on a <strong>3GPP</strong> LTE link-level


simulator. The simulator is compatible to <strong>3GPP</strong> specification<br />

defined in 36.211 and 36.212 [2]. The <strong>3GPP</strong> SCME model<br />

[10] is used as the channel model. In the simulation, 5000<br />

subframes have been simulated. 64-QAM modulation and 5/6<br />

coding rate is used. S<strong>of</strong>t-output sphere decoder which<br />

achieves ML per<strong>for</strong>mance is used as the detection method. No<br />

close-loop precoding is assumed. At most three<br />

retransmissions are allowed in the CC based H-ARQ.<br />

Simulation results are depicted in Fig. 6 and Fig. 7. Note that<br />

in the simulation, only 5MHz band is used instead <strong>of</strong> 20MHz<br />

band <strong>for</strong> the maximum throughput. The link-level BER and<br />

throughput figures show that the per<strong>for</strong>mance degradation<br />

introduced by the approximations (e.g. the NII based<br />

approximation) adopted in the parallel decoder is very small.<br />

TABLE II<br />

COMPARISON OF ASIC IMPLEMENTATION RESULTS<br />

<strong>Turbo</strong> <strong>Decoder</strong> This Work [4]<br />

Technology CMOS 65nm CMOS 130nm<br />

Supply Voltage 1.2V 1.2V<br />

Logic Size 70kgate 44.1kgate<br />

Total Memory 250kb 120kb<br />

Total Area 0.7mm 2 1.2mm 2<br />

Working Frequency 250MHz 246MHz<br />

Throughput 152Mbps 10.8Mbps<br />

Estimated Power 650mW 32.4mW<br />

Silicon Area/Throughput 0.005 mm 2 /Mbps 0.11 mm 2 /Mbps<br />

Power Efficiency 0.71nJ/b/iter 0.54nJ/b/iter<br />

IV. ASIC IMPLEMENTATION<br />

The <strong>Turbo</strong> decoder is implemented using 65nm CMOS<br />

process from STMicroelectronics. After placement and<br />

routing, it easily runs at 250MHz with a core area <strong>of</strong> 0.7mm 2 .<br />

The throughput <strong>of</strong> the Radix-2 parallel decoder can be<br />

computed as:<br />

Sblk,max<br />

× fclk<br />

T =<br />

(3)<br />

S + S + C × × N<br />

( )<br />

pw,max sw extra 2 ite<br />

Sblk<br />

,max<br />

Nsiso<br />

blk<br />

where S pw,max<br />

= . Here S<br />

,max<br />

= 6144 is the maximum<br />

number <strong>of</strong> bits per codeword defined in [2], N siso = 8 is the<br />

number <strong>of</strong> SISO units, S sw = 64 is the sliding window size,<br />

C extra = 10 is the number <strong>of</strong> overhead cycles, fclk<br />

= 250MHz<br />

is<br />

the clock frequency and N ite is the number <strong>of</strong> decoding<br />

iterations. The throughput <strong>of</strong> the decoder is 152Mbit/s in case<br />

N ite = 6 . In high SNR region, early stopping can effectively<br />

reduce the number <strong>of</strong> iterations needed to four. This gives a<br />

throughput <strong>of</strong> 228Mbit/s. Theoretically, <strong>for</strong> 20MHz<br />

bandwidth with 2× 2 Spatial Multiplexing and 64-QAM, the<br />

required throughput <strong>of</strong> <strong>Turbo</strong> decoding is 176Mbit/s, which<br />

means the presented implementation is sufficient <strong>for</strong> CAT4<br />

PDSCH decoding. Owing to the scalability <strong>of</strong> the on-chip<br />

network in [3], the <strong>Turbo</strong> decoder can be easily wrapped and<br />

integrated with the programmable plat<strong>for</strong>m. TABLE II depicts<br />

the ASIC implementation result in comparison to the prior-art<br />

presented in [4]. The result shows that work in this paper<br />

achieves more than 10 times data throughput compared to [4]<br />

while achieving higher silicon efficiency and slightly lower<br />

power efficiency even without the voltage scaling used in [4].<br />

V. CONCLUSION<br />

In this paper, a 650mW <strong>Turbo</strong> decoder is presented <strong>for</strong><br />

CAT4 LTE terminals. The design is based on<br />

algorithm/architecture co-optimization targeting a good trade<strong>of</strong>f<br />

between per<strong>for</strong>mance and throughput. The result shows<br />

that at feasible silicon cost, the parallel architecture<br />

significantly reduces the decoding latency while allowing a<br />

link-level per<strong>for</strong>mance that is close to the traditional serial<br />

decoder to be achieved.<br />

Fig. 8.<br />

S8<br />

Interleaver<br />

S7<br />

S2<br />

S1<br />

Control<br />

Layout <strong>of</strong> the <strong>Turbo</strong> <strong>Decoder</strong><br />

S6<br />

S3<br />

VI. ACKNOWLEDGMENT<br />

The authors would like to thank Christian Mehlführer and<br />

the Christian Doppler Laboratory <strong>for</strong> Design Methodology <strong>of</strong><br />

Signal Processing Algorithms at Vienna University <strong>of</strong><br />

Technology, <strong>for</strong> contributions on the simulators.<br />

REFERENCES<br />

[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit<br />

error-correcting coding and decoding: <strong>Turbo</strong> Codes,” in Proc. IEEE<br />

ICC, 1993, pp. 1064-1070.<br />

[2] <strong>3GPP</strong>, “Tech. Spec. 36.212 V8.4.0, E-UTRA; Multiplexing and Channel<br />

Coding,” Sept 2008.<br />

[3] Nilsson, E. Tell, and D. Liu, “An 11 mm 2 70 mW Fully-Programmable<br />

Baseband Processor <strong>for</strong> Mobile WiMAX and DVB-T/H in 0.12µm<br />

CMOS,” in IEEE Journal <strong>of</strong> Solid-State Circuits, vol. 44, no. 1, pp. 90-<br />

97, 2009.<br />

[4] Benkeser, A. Burg, T. Cupaiuolo, Q. Huang, “Design and Optimization<br />

<strong>of</strong> an HSDPA <strong>Turbo</strong> <strong>Decoder</strong> ASIC,” in IEEE Journal <strong>of</strong> Solid-State<br />

Circuits, vol. 44, no. 1, pp. 98-106, 2009.<br />

[5] S.Yoon, and Y. Bar-Ness, “A <strong>Parallel</strong> MAP Algorithm <strong>for</strong> Low Latency<br />

<strong>Turbo</strong> Decoding,” in IEEE Communication Letters, vol 6, pp. 288-290,<br />

July, 2002.<br />

[6] M.C. Shin and I.C. Park. “SIMD processor-based <strong>Turbo</strong> decoder<br />

supporting multiple third-generation wireless standards,” in IEEE Trans.<br />

on VLSI, vol.15, pp. 801-810, June 2007.<br />

[7] A.Dingninou, F. Raouafi, and C. Berrou, “Organisation de la memoire<br />

dans un turbo decodeur utilisant l'algorithm SUB-MAP,” in Proceedings<br />

<strong>of</strong> Gretsi, pp. 71-74, Sept. 1999.<br />

[8] R. Asghar, D. Liu, “Dual standard re-configurable hardware interleaver<br />

<strong>for</strong> turbo decoding,” in Proceedings <strong>of</strong> IEEE ISWPC, May 2008, pp. 768<br />

- 772.<br />

[9] R. Asghar, D. Wu, J. Eilert and D. Liu, “Memory Conflict Analysis and<br />

<strong>Implementation</strong> <strong>of</strong> a Re-configurable Interleaver Architecture<br />

Supporting Unified <strong>Parallel</strong> <strong>Turbo</strong> Decoding,” in Journal <strong>of</strong> Signal<br />

Processing Systems; DOI: 10.1007/s11265-009-0394-8. To Appear.<br />

[10] D. S. Baum, J. Salo, M. Milojevic, P. Kyöti, and J. Hansen “MATLAB<br />

implementation <strong>of</strong> the interim channel model <strong>for</strong> beyond-3G systems<br />

(SCME),” May 2005.<br />

S5<br />

S4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!