01.02.2015 Views

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

jm f jm<br />

next iteration. Intermediate values γ , ( , )<br />

k β k + 1 which are<br />

computed half-way to get β m k will be passed directly to<br />

the LLR unit to compute the extrinsic in<strong>for</strong>mation Λ ( d k ).<br />

• LLR unit: It computes the extrinsic in<strong>for</strong>mation<br />

( m 1, m f (1, m ) 0, (0, )<br />

) ( m m f m )<br />

7 7<br />

Λ ( d ) = max α + γ + β − max α + γ + β<br />

Fig. 5.<br />

k k k k+ 1 k k k+<br />

1<br />

m= 0 m=<br />

0<br />

and the s<strong>of</strong>t-output based on the result from α unit and β<br />

unit. As the output <strong>of</strong> the SISO units, Λ ( d k ) is passed<br />

through the switch network to different memory banks <strong>for</strong><br />

later access by the SISO unit in the next iteration.<br />

0<br />

P1<br />

A1<br />

+<br />

–<br />

S1<br />

P2<br />

A2<br />

msb<br />

1<br />

0<br />

S2<br />

f2 >>1<br />

P3<br />

A3<br />

K<br />

AM<br />

+<br />

S3<br />

P4<br />

A4<br />

g(0)<br />

g_c<br />

1<br />

0<br />

R<br />

g_x_a<br />

S4<br />

P5<br />

<strong>Parallel</strong> Interleaver Architecture<br />

A5<br />

g_x_b<br />

g_x<br />

S5<br />

P6<br />

A6<br />

K<br />

AM<br />

+<br />

S6<br />

P7<br />

A7<br />

S<br />

R<br />

S7<br />

A<br />

P8<br />

A8<br />

achieved with the replication <strong>of</strong> the hardware <strong>for</strong> basic<br />

interleaver and getting support from a LUT providing the<br />

starting values. In this case a total <strong>of</strong> 32 additions are needed.<br />

However, <strong>for</strong> the sake <strong>of</strong> hardware reuse, part <strong>of</strong> the basic<br />

interleaver unit can be shared to generate multiple addresses at<br />

the same time. The optimized hardware uses 18 additions in<br />

total to generate 8 parallel interleaver addresses, thus saving<br />

14 additions. The hardware <strong>for</strong> parallel interleaver address<br />

generation <strong>for</strong> LTE is shown in Fig. 5.<br />

C. H-ARQ<br />

In LTE, two s<strong>of</strong>t-combining methods namely Chase<br />

Combining (CC) and Incremental Redundancy (IR) are<br />

supported. In this paper, only CC is considered. The base<br />

station retransmits the same packet when it receives the<br />

NACK signal from UE, and UE combines the LLR<br />

in<strong>for</strong>mation generated by the detector with those <strong>of</strong> the<br />

initially received packet. The gain <strong>of</strong> H-ARQ is at the cost <strong>of</strong><br />

larger s<strong>of</strong>t buffer used to store the demodulated s<strong>of</strong>t<br />

in<strong>for</strong>mation. In order to support the buffer size as large as<br />

1.8Mbits, high-density memory such as 1T-SRAM needs to be<br />

used. Since 1T-SRAM is different from the technology used<br />

<strong>for</strong> the decoder, it is not included in the results <strong>of</strong> this paper.<br />

B. <strong>Parallel</strong> Interleaver<br />

The internal interleaver involved with the <strong>Turbo</strong> Code in<br />

<strong>3GPP</strong>-LTE is based on quadratic permutation polynomial<br />

(QPP). Thanks to the QPP interleaver [2] which allows<br />

conflict-free parallel access, the implementation <strong>of</strong> the parallel<br />

interleaver is straight-<strong>for</strong>ward. In this paper, to support the<br />

parallel <strong>Turbo</strong> decoder, a configurable parallel interleaver is<br />

adopted. It can generate eight addresses in parallel <strong>for</strong> the data<br />

access <strong>of</strong> eight SISO units. QPP interleavers have very<br />

compact representation methodology and also inhibit a<br />

structure that allows the easy analysis <strong>for</strong> its properties. The<br />

interleaving function <strong>for</strong> turbo code is specified by the<br />

following quadratic permutation polynomial:<br />

2<br />

( )<br />

A( x) = f1ix+<br />

f2 i x % K<br />

(1)<br />

Here x = 0,1, 2, …( K − 1) , where K is the block size. This<br />

polynomial provides deterministic interleaver behavior <strong>for</strong><br />

different block sizes with appropriate values <strong>of</strong> f 1<br />

and f<br />

2<br />

.<br />

Direct implementation <strong>of</strong> the permutation polynomial given in<br />

eq. (1) is in-efficient due to multiplications, modulo function<br />

and bit growth problem. The simplified hardware solution is<br />

to use recursive approach adopted in [8] and [9]. Eq. (1) can<br />

be re-written <strong>for</strong> recursive computation as:<br />

( A )<br />

A( x+ 1) = ( x) + g( x) % K<br />

(2)<br />

Where g( x) = ( f1+ f2+ 2. f2. x)<br />

% K , which can also be<br />

computed recursively as g( x+ 1)<br />

= ( g( x) + 2. f2)<br />

% K . The two<br />

recursive terms I ( x+<br />

1)<br />

and g<br />

( x + 1)<br />

are easy to compute and they<br />

provide the basic interleaver functionality <strong>for</strong> LTE.<br />

Owing to the parallelism inherited in the QPP interleaver,<br />

the generation <strong>of</strong> parallel interleaver addresses can be<br />

Fig. 6.<br />

Fig. 7.<br />

BER<br />

10 −2<br />

10 −3<br />

10 −4<br />

10 −5<br />

10 −6<br />

uncoded BER<br />

coded BER <strong>Parallel</strong> <strong>Turbo</strong><br />

coded BER Serial <strong>Turbo</strong><br />

10 −7<br />

20 22 24 26 28 30 32<br />

10 −1 SNR [dB]<br />

Bit-Error-Rate (rate 0.926, 64-QAM)<br />

throughput [Mbit/s]<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

coded throughput Serial <strong>Turbo</strong><br />

coded throughput <strong>Parallel</strong> <strong>Turbo</strong><br />

uncoded throughput<br />

0<br />

20 22 24 26 28 30 32<br />

SNR [dB]<br />

Link-level Throughput (rate 0.926, 64-QAM)<br />

III. SIMULATION AND SYSTEM PERFORMANCE<br />

In order to benchmark the parallel <strong>Turbo</strong> decoder,<br />

simulation is carried out based on a <strong>3GPP</strong> LTE link-level

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!