A HighThroughput Programmable Decoder for LDPC Convolutional ...
A HighThroughput Programmable Decoder for LDPC Convolutional Codes
Marcel Bimberg, Marcos B.S. Tavares, Emil Matúˇs and Gerhard P. Fettweis
Vodafone Chair Mobile Communications Systems
Technische Universitt Dresden, D01069 Dresden, Germany
Emails:{bimberg, tavares, matus, fettweis}@ifn.et.tudresden.de
Abstract
In this paper, we present and analyze a novel decoder architecture
for LDPC convolutional codes (LDPCCCs). The
proposed architecture enables high throughput and can be
programmed to decode different codes and blocklengths,
which might be necessary to cope with the requirements of
future communication systems. To achieve high throughput,
the SIMD paradigm is applied on the regular graph
structure typical to LDPCCCs. We also present the main
components of the proposed architecture and analyze its
programmability. Finally, synthesis results for a prototype
ASIC show that the architecture is capable of achieving decoding
throughputs of several hundreds MBits/s with attractive
complexity and power consumption.
1. Introduction
Lowdensity paritycheck (LDPC) codes were discovered
by Gallager in 1963 [6] and, nowadays, they are among
the most promising error correcting schemes. The renewed
interest in Gallager’s LDPC codes can be justified by their
simplicity and by their attractive performance/complexity
tradeoff. Currently, LDPC codes are being considered by
the standardization committees of several future communication
systems as serious candidates for the error control
coding.
The convolutional counterparts of Gallager’s LDPC
codes – the LDPC convolutional codes (LDPCCCs) – were
introduced in [1]. Compared with their block counterparts,
the LDPCCCs are not limited to a unique blocklength. Instead,
the same encoder/decoder structure can be used to encode/decode
different codeword lengths, allowing easy adjustment
for changing environment conditions. Therefore,
they are highly recommended for next generations wireless
communication systems, demanding high flexibility. The
encoding of the LDPCCCs is performed in lineartime using
shiftregister operations and their decoding is facilitated
by their highly structured underlying graphs.
As we will show in the next section, LDPCCCs can be
decoded using low complexity iterative algorithms, where
extrinsic information is exchanged between two decoding
steps. In a hardware implementation, this exchange of
messages is performed by interleavers [10]. When implementing
a decoder for LDPC block codes, such interleavers
become rapidly more and more complicated when issues
as high throughput and huge block sizes are considered.
Specifically, the implementation of more promising irregular
block codes demands a combined codearchitecture construction
leading to a tradeoff between hardware complexity
and error correction performance. Moreover, parallelization
concepts for decoding LDPC block codes are generally
limited to the subblock size of a base matrix they are derived
from [7],[8].
The original construction of LDPCCCs presented in [1]
is pseudorandom. In [11], Tanner et al. took advantage of
the relation between convolutional codes and quasicyclic
(QC) block codes to derive LDPCCCs. The obtained LD
PCCCs through this method are timeinvariant. When implementing
timeinvariant LDPCCCs, the problems with
the exchange of messages can be easily overcome. For instance,
the graph regularity guarantees lowcomplexity interleavers,
very simple memory addressing and also homogeneity
in the parallel architecture.
The first architecture concepts for LDPCCC decoders
were presented in [2] and [12], where an ASIC architecture
was designed for encoding/decoding one special LD
PCCC. The applied concepts were mainly derived from the
pipeline decoding algorithm proposed in [1]. In this paper,
we present a novel lowcomplexity highly parallel decoder
architecture for timeinvariant LDPCCCs. Although only
the regular LDPCCCs from [11] are considered throughout
this paper, our architecture is also capable of decoding irregular
LDPCCCs. In this case, no changes in the hardware
are necessary: irregular codes can be completely accommodated
in our decoding architecture only by writing the
corresponding software.
2. LDPC Convolutional Codes
As described in [6], LDPC codes are defined by sparse
paritycheck matrices. In the case of LDPCCCs, the paritycheck
matrices, which are called syndrome former matrices,
show a diagonal structure and are semiinfinite [1]. Thus,
the syndrome former HT of an LDPCCC can be written as
⎛
H T ⎜
= ⎜
⎝
H T 0 ··· H T ms
. ..
. ..
H T 0 ··· H T ms
. .. . ..
⎞
⎟ , (1)
⎠
where the scalar submatrices H T ν, ν =0, 1, ··· ,ms, have
dimensions c × (c − b), and so determine the rate of the
code, which is given by R = b/c (i.e., b represents the
number of information bits and c the number of coded bits).
As for LDPC block codes (LDPCBCs), a code sequence v
belonging to an LDPCCC satisfies the parity check equation
vH T = 0. Furthermore, if the number of ones in each
row of H T is J and K is the number of ones in each column,
the LDPCCC is called regular and is referenced as
an (ms,J,K)LDPCCC (otherwise it is called an irregular
code). Obviously, J and K indicate the density of connections
for the graph nodes. The parameter ms defines
the memory of the convolutional code and consequently the
critical distance of the graph. The critical distance of an
LDPCCC is given by ms +1and represents the minimum
temporal distance between nodes that are not connected to
each other.
2.1. Decoding Algorithm
The Decoding of LDPCCCs can be accomplished by
applying an iterative message passing algorithm to the
received code sequence. As shown in [4], the MinSum
algorithm is a good approximation for fixed point implementations.
By utilizing loglikelihoodratios (LLRs),
this algorithm requires low complexity processing. The
messages mij that are passed along the edges connecting
variable and check nodes are calculated according to the
following decoding equations:
0. Initialization:
1. Check node update:
mji = sign(mij)
2. Variable node update:
mij = ci = LLR(chi) (2)
i ′ ∈V j
mij = ci +
sign(mi ′ j) · min
i ′ mi
∈Vj\i ′ j (3)
mj ′ i −mji
j ′
∈Ci
Qi=Soft decision value
(4)
3. Hard decision:
ˆvi =
0 if Qi ≥ 0
1 else
Symbols:
chi channel information belonging to variable node i
ci LLR of channel information belonging to variable node i
mij message passed from variable node i to check node j
mji message passed from check node j to variable node i
Vj set of all variable nodes connecting to check node j
Ci set of all check nodes connecting to variable node i
ˆvi estimated bit value for variable node i
The initialization phase is followed by the algorithm repeating
through steps 1–3 until either the parity check equation
vH T = 0 is fulfilled or a maximum number of iterations
has been reached. In the implementation presented
in section 3, the decoder executes a predefined number of
decoding iterations.
2.2. Parallel Decoding Concept
LDPCCCs can be described by a bipartite graph as
shown in Fig. 1 for a (3, 2, 3)LDPCCC code. As the graph
connections between variable and check nodes are the same
at each time instant for timeinvariant codes, these codes are
well suited for a homogeneous, parallel VLSI implementation.
The parallelization method applied within our implementation
relies on the node level parallelization concept,
which was investigated among others in [9]. Fig. 1(a) shows
the principle underlying this parallelization concept, which
is used as basis for developing our highly parallel decoding
architecture. Here, variable nodes are grouped into non
Processing window Processing window
(a)
Checknode
p t = 2
Processing
flow
Variablenode
Vector CN operation
Vector VN operation
C
D
E
(b)
B
Vector operand for CN operation
Vector operand for VN operation
Message vectors stored in memory
Figure 1. Principle of node level parallelization
of order pt =2
overlapping segments called processing windows of length
pt. The message vectors of length pt are loaded sequentially
and fed to the vector computing elements responsible
for processing pt check or variable operations simultaneously.
Efficient implementation of the vector processing can
be achieved by using the SIMD computing model, which
A
(5)
OFFSET
AGU0 AGU1
RAM
Address generation (AG)
IMEM
REG
FILE
Control Unit
SHIFT
RAM
DECODER
FIFO
AGU1
CMP
FIFO
SHIFT
FIFO
SHIFT 1
exploits the independence and regularity of the graph connections.
Due to potential memory misalignments, the LD
PCCC decoder demands the usage of a shuffle network. In
Fig. 1(b), an example is given where the dashed rectangles
represent the message vectors as they are stored in memory.
According to this placement, the messages are already
aligned for variable node operations, e.g., vector messages
D and E. However, for check node operations, the memory
alignment is not always provided, as one can see from
vector messages B and C. In this case, a vector realignment
procedure needs to be applied between variable and check
node computation.
3. Processor Architecture and Implementation
Details
Broadly speaking, our LDPCCC decoder is based on the
synchronous transfer architecture (STA) presented in [5].
The STA provides an efficient platform for vectorized signal
processing algorithms in terms of low power consumption
and high performance. Therefore, it is a very good
choice for the implementation of our parallelized LDPCCC
decoder. The block diagram in Fig. 2 shows the disposition
of the previously described decoding algorithm into an
address generation and a datapath part. While address generation
utilizes 16bit fixed point arithmetic logic, the datapath
is designed for vector processing, where each vector
consists of pt 8bit width data values.
3.1. Memory Organisation
3.1.1 Instruction Memory (IMEM)
In order to provide flexibility for decoding different LDPC
CCs, specific program codes can be loaded into the instruction
memory (IMEM). For this purpose, a DMA interface
was implemented that is used to transfer data into the memories.
By using very long instruction words (VLIWs), all
FIFO
CMP
ReadAdr
WriteAdr
ByteSel
Datapath
Vector
FIFO
DMEM
ReadPort WritePort
pt⋅N BARSHIFT
pt⋅N pt⋅N Proc.
Node 1
pt⋅N Proc.
Node pt
Vector ALU
BARSHIFT 1
MUX
t
pt⋅N p ⋅ N
Figure 2. Block diagram of the LDPCCC decoder
functional units are able to work in parallel, thereby avoiding
stall cycles. Currently, the instruction words for our
implementation have widths of 127 bits (without any compression).
For prototyping purposes, we have chosen the
size of the instruction memory to be 1024 × 127 bits. For
specific implementations, IMEM can be downsized. As we
will show in section 4, the total number of VLIWs required
to implement a regular (ms,J,K)LDPCCC is given by
NVLIW =4JK +3J +6K +32. (6)
3.1.2 Data Memory (DMEM)
The data memory accommodates both the channel LLRs
and the messages that are exchanged between variable and
check nodes during the decoding iterations. An appropriate
addressing scheme, that keeps the decoder flexible, will be
described in more detail in section 3.2. As depicted in Fig.
1(b), each vector edge corresponds to one memory location
that can be accessed by using a vector load/store instruction.
In our implementation, pt = 64 values reside into
one vector. This results in a total vector bitwidth of 512
bits when N =8bits are used for softvalue representation.
The vector edges are aligned according to the variable node
perspective. The associated channel values are stored in the
same manner. If we incorporate the additional 2⌈mS/pt⌉pt
overhead slots surrounding one coded sequence of length
L, the minimum memory size required for decoding can be
summedupto:
L
C =
K +2
mS
pt · K · (J +1)· N [Bits], (7)
pt
where L/K is the number of time slots carrying coded bits.
Typical maximum codeword lengths that can be decoded
withamemorysizeof64 KByte as in our implementation,
range e.g. from 7594 bits for a (128, 5, 13)LDPCCC up to
15104 bits for a (127, 3, 5)LDPCCC. In order to keep the
decoding pipeline filled, a twoport RAM was implemented
so that new values can be concurrently loaded into the processing
nodes while computed results are written back into
memory.
3.2. Address Generation
As the interconnections between variable and check
nodes stay the same for each time instant, it is sufficient
to store K(J +1)offset values for variable node computations
plus KJ offset values for check node computations
in one RAM. The functional units (FU) involved in address
generation are shared by the two operation modes of the
processor, namely check node and variable node computations.
According to this, the minimum size of the OFFSET
RAM depends on the code with the largest product KJ.
In our implementation, 256 16bit values can be stored in
OFFSETRAM, enabling the decoding of codes having parameters
K ≤ 16 and J ≤ 7.
The data values in DMEM are addressed by adding an
offset value provided by the OFFSETRAM to a 16bit basis
address that is pointing to the current ptwidth processing
window. While AGU0 addresses the OFFSETRAM in
sequential order using modulo arithmetic, AGU1 is used for
summation. The output address of AGU1 is directly used
for loading packed data from the vector memory. Since
every message update is stored at the same address where
the original message was loaded from, FIFOAGU1 buffers
the load address until it is used for store operation when
the update has finished. Concurrently, CMP uses the output
address to check if it fits into the interval spanned by
start and end address of the received block. If this condition
is violated, a proper setting of the byte select signal
for the corresponding store operation avoids updates on values
surrounding the received block sequence (each received
block is enclosed by LLRs representing zero bits, which is
based on the assumption that encoding starts and ends in
zero state).
While operating in check node mode, additional 7bit
width setup values are provided for rotation to two barrel
shifters by an extra SHIFTRAM (Fig. 2). Similarly to the
offset values, the shift values are loaded sequentially and
buffered by FIFOs that are serving the barrel shifters. The
chosen SHIFTRAM size of 128 words enables the processor
to decode various LDPCCCs.
3.3. Data Flow
After the first address is calculated, the processor starts
to load ptwidth packed data from vector memory into the
barrel shifter. When operating in variable node mode, rotations
on the loaded data can be skipped since it is already
aligned in memory for variable node operation. Therefore,
the data is just fed forward into the pt processing nodes
di
SM ff
2'C
+
SUM
2'C ff
SM
do0
Reg1

2'C ff
SM
DEMUX
Reg2 Reg8

2'C ff
SM
do1 do2 do8
(a) Variable node mode

2'C ff
SM
di
Shifting
network
Inverse shifting
network
B B’
Vector
ALU
Vector
FIFO
B B’
Vectormemory load
Rearanged vector
CN computation
Results
Vectormemory store
Figure 4. Rearranging vector data words for
check node operation
of all incoming data belonging to the same check node are
evaluated (Fig. 3b). As the loaded data is in signed magnitude
representation, the incoming [N − 2:0]bits already
represent the magnitude value. Simultaneously, the leading
sign bits are evaluated by applying an xorfunction. In the
last stage, all message updates are computed in one cycle
by comparing the internally stored input values to the first
and second minimum. Thus, the updates (equation (3)) can
be written as:
MIN1 if mij = MIN1
mji =
(8)
MIN2 else.
In order to guarantee the correct data alignment for the next
variable node operation, an inverse rotation is performed
on each data pair, residing in the FIFO and the processing
nodes, before it is written back into memory. A detailed
description of the implemented barrel shifters is given in
the next section.
3.3.1 Barrel Shifter Implementation
Opposed to the results shown in [9], the barrel shifter presented
here is better suited for area constrained designs.
Instead of using a fully parallel structure for the shifting
network, our implementation applies a hierarchical design
consisting of three stages (Fig. 5). Shifting in each stage is
controlled by dedicated bits in the input shift value. If not
all pt +1possible rotations become necessary, area can be
further reduced by restricting to a particular set of codes.
The complete set of cyclic rotations for a certain code is
then given by the elements of matrix
S = log D[H T (D)] mod (pt), (9)
where log(.) and mod (.) operations are performed
elementwise on the polynomial syndrome former matrix
H T (D). Place&Route (P&R) results show that the implemented
barrel shifter, featuring all possible rotations
for pt = 64, is able to meet the constraint frequency of
fclk = 200 MHz without any pipelining.
shift value
[6:0]
[1023:1016] ...
[2:0]
[5:3]
[6]
packed data on
input 1
[1023:1016] ...
128 x (8:1) Multiplexers (8 bit input width)
16 x (8:1) Multiplexers (64 bit input width)
2 x (2:1) Multiplexers (512 bit input width)
Vector FIFO
[519:512] [511:504]
[519:512] [511:504]
packed data on
input 2
...
...
Vector ALU
[7:0]
[7:0]
Figure 5. Barrel shifter implementation
4. Programming Issues
1 cycle
The programming of our architecture is perhaps the task
which better reflects our design philosophy and, consequently,
also justifies our preference for the LDPCCCs.
Namely, because of the regularity and locality of the connections
of the graphs representing LDPCCCs, it is possible
to obtain memories and data transfer architectures with low
complexities and without the need for conflict and/or stall
resolutions. In this context, we have directed our efforts to
obtain a software architecture that takes advantage of these
properties offered by the LDPCCCs.
The program code implementing an arbitrary regular
(ms,J,K)LDPCCC consists of the different sections
listed in Table 1. In the first code section (INI
TIAL PHASE), the decoding is started. This section continues
until the first results of the check node computations
are to be stored into the memory. The next code section
is CHECKOP CORE, where the check node computations
are continued. Differently from INITIAL PHASE,
data vectors are simultaneously read and written from/into
DMEM. Actually, the code section CHECKOP CORE processes
all check node operations inside a processing window
of length pt (see Fig.1). This fact can also be observed
by the number of required VLIWs (2JK) 1 , which correspond
exactly to the number of data vectors required for the
check node operations inside a processing window. Logically,
CHECKOP CORE is repeated until all check nodes
corresponding to the received codeword – which are divided
in processing windows – are processed. The next
code section to be processed is CHECKOP TAIL. In this
section, two major activities are performed: (1) it represents
the inverse of INITIAL PHASE, i.e., the last results
of the check node operations are stored in DMEM
and thus the FIFOs are flushed; (2) the processing of the
bit nodes is started. The code sections BITOP CORE and
1 The factor 2 is due to the misalignment of the data vectors occurring
in the check node operations.
BITOP TAIL are the counterparts of CHECKOP CORE
and CHECKOP TAIL for the bit node processing, respectively.
Observe also that the number of VLIWs required
in BITOP CORE ((J +1)K +1) correspond to the number
of data vectors required for the bit node computations
plus one. The additional VLIW is required because
AGU1 – which is also responsible for the update of the
counter controlling the loop over BITOP CORE – is always
busy with the addresses computations. Obviously, this extra
cycle could be avoided with the insertion of additional
hardware. Finally, the sections LAST IT BITOP INITIAL,
LAST IT BITOP CORE and LAST IT BITOP TAIL are
equivalent to the second activity of CHECKOP TAIL, to
BITOP CORE and to BITOP TAIL, respectively. The only
difference is that these sections are only processed in the
last iteration and that the positions of the channel values
ci in DMEM are overwritten by the final values given by
the decoding operation, i.e., the Qi’s from (4). As an illus
Code Section Number of VLIWs
INITIAL PHASE 8+2K
CHECKOP CORE 2JK
CHECKOP TAIL 5+2K
BITOP CORE (J +1)K +1
BITOP TAIL 5+J
LAST IT BITOP INITIAL 6+J
LAST IT BITOP CORE (J +1)K +1
LAST IT BITOP TAIL 6+J
Total 4JK +3J +6K +32
Table 1. Sections of the program code for an
arbitrary regular (ms,J,K)LDPCCC.
tration, Fig. 6 shows the flow graph of the program code
for an arbitrary LDPCCC. The condition ’End of Graph?’
checks if the check node or bit node processing have been
done over all processing windows. The condition ’Last Iteration?’
checks if the decoding is in the last iteration so
that LAST IT BITOP INITIAL, LAST IT BITOP CORE
and LAST IT BITOP TAIL are processed.
5 Decoder Throughput Analysis
The number of cycles necessary for one complete iteration
of our decoder can be derived from Table 1 and is given
by
N Cycles/Iter. ≈ (8 + 2K)
ms
+ NW +
INITIAL PHASE
+NW (J +1)K +1
BITOP CORE
pt
· (2JK)
+ (5 + 2K) + (5 + J) ,
CHECKOP TAIL
CHECKOP CORE
BITOP TAIL
+
(10)
START
INITIAL_PHASE
CHECKOP_CORE
No
End of Graph?
Yes
CHECKOP_TAIL
Yes
Last Iteration?
No
BITOP_CORE
No
End of Graph?
Yes
BITOP_TAIL
LAST_IT_BITOP_IN
ITIAL
LAST_IT_BITOP_C
ORE
No
End of Graph?
LAST_IT_BITOP_T
AIL
Figure 6. Flow graph of the program code.
where NW is the number of processing windows corresponding
to the length of the code sequence to be decoded.
Observe that the expression above only approximates the
real throughput of the architecture. The reason for it are
the jumps in the program flow that eventually occur inside
CHECKOP CORE and BITOP CORE.
For a regular (ms,J,K)LDPCCC the number of coded
bits in a processing window is given by LW = pt · K. The
throughput is calculated as T = NInfobits/N Cycles/Iter. Ifwetake
the expression in (10) and perform some manipulations we
will have the following equation for the throughput:
T ≈
NW
2JK +(J +1)K +1
I
END
NWLWR
Yes
+
18 + 4K + J +(2JK)
ms
II
pt
(11)
where R =1− J/K is the rate of the code. Observe that
part I of the denominator as well as the numerator depend
on the number of processing windows NW. Part II is a constant
being determined by the code parameters J and K.
Consequently, if we have NW large enough, part II can be
neglected and the throughput can be approximated by
T ≈ pt · K − J
3JK + K +1
Infobits
Cycle × Iteration
. (12)
Interestingly, the expression in (12) only depends on the
code parameters (J, K) and on the parallelism of the architecture
pt. Fig. 7 shows the throughputs achieved by our
concept for different codes and parallelisms pt. The clock
frequency is set to fclk = 200 MHz. Considering our implementation
(pt =64), two points for different codes were
,
Throughput [MBits/s]
2400
2000
1600
1200
800
400
(J=3, K=5)
(J=3, K=7)
(J=3, K=13)
(J=3, K=17)
(J=5, K=7)
(J=5, K=11)
(J=5, K=13)
(J=3,K=5) Measured
(J=5,K=13) Measured
0
0 20 40 60 64 80 100 120 140
p
t
Figure 7. Throughput for one iteration depending
on the parallelism pt. The clock frequency
is fclk = 200 MHz.
measured. For the code with parameters (ms = 127,J =
3,K =5), the blocklength was L = 3200, i.e., NW =10.
For the code with parameters (ms = 128,J =5,K = 13),
the blocklength was L = 5824, i.e., NW =7. As we can
observe, the measurements are almost coinciding with the
curves obtained from (12) already at these relative short
blocklengths.
6. Simulation Results
Fig. 8 shows the bit error rates for the regular (127, 3, 5)
and (128, 5, 13) LDPCCCs with blocklengths L = 3200
and L = 5824, respectively. The curves were obtained by
decoding these codes using 10, 30 and 50 iterations of the
4bit and 8bit quantized MinSum algorithm. As it can be
observed, there is a considerable coding gain between 10
and 30 decoding iterations. On the other hand, the gain
between 30 and 50 iterations is almost negligible for both
codes. The effects of different quantizations can also be observed
in Fig. 8. At least for these two codes that we are
investigating, there is no significant gain for using 8bit instead
of 4bit quantization. Despite of this, we implemented
our design using 8bit quantization because for other codes
(e.g., irregular codes) it might result in faster convergence
(i.e., less iterations for a certain target BER) and also lower
error floors.
BER
10 0
10 −1
10 −2
10 −3
10 −4
10
1 1.5 2 2.5 3 3.5
−8
10 −7
10 −6
10 −5
(127,3,5) − 4 bits − 10 Iter.
(127,3,5) − 8 bits − 10 Iter.
(127,3,5) − 4 bits − 30 Iter.
(127,3,5) − 8 bits − 30 Iter.
(127,3,5) − 4 bits − 50 Iter.
(127,3,5) − 8 bits − 50 Iter.
(128,5,13) − 4 bits − 10 Iter.
(128,5,13) − 8 bits − 10 Iter.
(128,5,13) − 4 bits − 30 Iter.
(128,5,13) − 8 bits − 30 Iter.
(128,5,13) − 4 bits − 50 Iter.
(128,5,13) − 8 bits − 50 Iter.
E /N [dB]
b 0
Figure 8. BER curves for two different codes
decoded with 4bit and 8bit quantized Min
Sum and 10, 30 and 50 iterations.
7. Tool Flow and Implementation Results
Based on the integrated design flow shown in [5], the
architecture was first described as an XML model, which
served as input for automated simulation model, assembler
and HDL generation. The data input values for each
of the memories are also automatically generated by utilization
of a developed Matlab environment. Synthesis for
UMC130nm, 8metal layer, 1.2V, CMOS technology was
accomplished with SYNOPSYS Design Compiler. In order
to reduce power consumption, operand isolation and clock
gating were deployed. For efficient memory compilation
Faraday’s tool memaker was used. Due to the huge bit
width of the instruction and data memory, both were partitioned
into several banks. As depicted in the final P&R
layout (Fig. 9), the instruction memory IMEM consists of
two banks and data memory DMEM of eight banks. The
system clock frequency fclk = 200 MHz was met by a chip
area of 7.83 mm 2 and a utilization of 69.5%. Table 2 shows
the area contribution of each unit in more detail. As we can
observe, the vector FIFO and vector ALU contribute with
almost 80% to the area of the computational core. The total
gate count of the decoder accumulates to 1.2 MGates.
After P&R, the power consumption was estimated using
PrimePower. For this purpose, we simulated the decoding
of the (128, 5, 13)LDPCCC with a blocklength L = 5824
bits. At 2.8 dB and a clock frequency of fclk = 200 MHz,
the average power consumption for decoding with 10 iterations
is 437 mW. This results in an energy consumption
of 660 pJ per decoded bit and a bit error rate BER
≈ 10 −3 . Table 3 shows the breakdown of power consump
tion for each unit of the LDPCCC decoder after simulating
the (128, 5, 13)LDPCCC. Clearly the vector ALU contributes
most to the power consumption of the LDPC decoder
with 56%.
Computational Core Gates Relative
AG 18949 4%
BARSFT 26508 6%
BARSFT −1
42414 10%
VALU 263248 60%
VFIFO 82851 19%
CTRL
6093
440063
1%
100%
Memories
IMEM 89670 12%
OFFSET & SHIFT RAM 21086 3%
DMEM
651035
774290
85%
100%
LDPCCC decoder
Computational Core 440 K 37%
Memory
774 K
1214 K
63%
100%
Table 2. Gate count for LDPCCC decoder
Computational Core Power (mW) Relative
AG 13 4%
BARSFT 30 9%
BARSFT −1
37 11%
VALU 188 56%
VFIFO 37 11%
CTRL
30
335
9%
100%
Memories
IMEM 14 14%
OFFSET & SHIFT RAM 4 4%
DMEM
84
102
82%
100%
LDPCCC decoder
Computational Core 335 77%
Memory
102
437
23%
100%
Table 3. Power consumption for LDPCCC decoder
8. Conclusions
In this paper, a novel programmable decoder architecture
for timeinvariant LDPCCCs was presented. The architecture
is suitable for decoding various timeinvariant LDPC
CCs and runs at a moderate clock frequency of 200 MHz.
Because of the regularity of timeinvariant LDPCCCs, the
architecture is highly parallel and able to achieve a throughput
of several hundred MBit/s. At 2.8 dB the measured average
power consumption with a supply voltage of 1.2 Vfor
the (128, 5, 13)LDPCCC is 437 mW. Beside further power
measurements on the chip, which is currently prepared for
tapeout, our ongoing research will investigate an enhanced
memory management methodology enabling power reduction
and higher throughput within a multicore architecture.
2.7 mm
DMEM DMEM DMEM DMEM
DMEM
DMEM
OFFSET
RAM
SHIFT
RAM
IMEM
2.9 mm
IMEM
DMEM
DMEM
Figure 9. LDPCCC decoder layout
9. Acknowledgments
This work was supported by the German ministry of research
and education (BMBF) within the Wireless Gigabit
with Advanced Multimedia Support (WIGWAM) project
under grant 01 BU 370. The authors would like thank Georg
Ellguth for his assistance in P&R.
References
[1] A. Jiménez Feltström and K.Sh. Zigangirov. Periodic timevarying convolutional
codes with lowdensity paritycheck matrices. IEEE Trans. Inform.
Theory, 45(5):2181–2190, Sep 1999.
[2] S. Bates and G. Block. A memorybased architecture for FPGA implementations
of lowdensity paritycheck convolutional codes. In Proc. IEEE International
Symposium on Circuits and Systems (ISCAS), Kobe, Japan, 2005.
[3] S. Bates, L. Gunthorpe, A. Pusane, Z. Chen, K.Sh. Zigangirov, and D.J.
Costello, Jr. Decoders for lowdensity paritycheck convolutional codes with
large memory. In Proc. NASA VLSI Symposium, 2005.
[4] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X. Hu. Reducedcomplexity
decoding of LDPC codes. IEEE Trans. Commun., 53(8), Aug
2005.
[5] G. Cichon, P. Robelly, H. Seidel, E. Matúˇs, M. Bronzel, and G. Fettweis.
Synchronous transfer architecture (STA). In SAMOS, pages p126–130, June
2004.
[6] R. Gallager. LowDensity ParityCheck Codes. MIT Press, Cambridge, MA,
1963.
[7] M. Karkooti, P. Radosavljevic, and J.R. Cavallaro. Configurable, high
throughput, irregular LDPC decoder architecture: Tradeoff analysis and implementation.
In IEEE International Conference on Applicationspecific Systems,
Architectures and Processors (ASAP), Colorado, USA, Sept 2006.
[8] M. Mansour and N. Shanbhag. A 640Mb/s 2048Bit programmable LDPC
decoder chip. In IEEE J. SolidState Circuits, volume 41, March 2006.
[9] E. Matúˇs, M.B.S. Tavares, M. Bimberg, and G. Fettweis. Towards a GBit/s
programmable decoder for LDPC convolutional codes. In Proc. IEEE International
Symposium on Circuits and Systems (ISCAS), Kobe, Japan, May
2007.
[10] T. Richardson and V. Novichkov. Methods and apparatus for decoding LDPC
codes. In U.S. Patent No. 7,133,853, 2006.
[11] R.M. Tanner, D. Sridhara, A. Sridharan, T.E. Fuja, and D.J. Costello, Jr.
LDPC block and convolutional codes based on circulant matrices. IEEE
Trans. Inform. Theory, 50(12):2966–2984, Dec 2004.
[12] R. Swamy, S. Bates, and T. Brandon. Architectures for ASIC implementations
of lowdensity paritycheck convolutional encoders and decoders. In
Proc. IEEE International Symposium on Circuits and Systems (ISCAS), Kobe,
Japan, 2005.