BLIND TURBO DECODING OF SIDE-INFORMED DATA HIDING ...

**BLIND** **TURBO** **DECODING** **OF** **SIDE**-**INFORMED** **DATA** **HIDING** USING ITERATIVECHANNEL ESTIMATIONFélix BaladoUniversity College DublinBelfield, Dublin 4, IrelandFernando Pérez-GonzálezUniversity of VigoCampus Universitario, 36200 Vigo, SpainABSTRACTDistortion-Compensated Dither Modulation (DC-DM) has beentheoretically shown to be a near-capacity achieving data hidingmethod, thanks to its use of side information at the encoder. Inpractice, channel coding is needed to approach its achievable ratelimit. However, the most powerful coding methods, such as turbocoding, require knowledge of the channel model. We investigatehere the possibility of undertaking blind iterative decoding of DC-DM. To this end, we undertake maximum likelihood estimationof the channel model, intertwining the Expectation-Maximizationalgorithm within the decoding procedure.1. INTRODUCTIONThe use of side information at the encoder has proven crucial to thedata hiding problem. The solution provided by Costa [1] for a similarcommunications setting has been decisive to show that hostsignal-inducedself-distortion can be effectively removed througha clever design of the transmission codebook. In the context ofdata hiding this result was first pointed out by Chen and Wornell,who showed [2] that their DC-DM scheme was, asymptotically,only 1.53 dB away from Costa’s capacity.Channel coding is the way to approach channel capacity inany communications scenario, and, therefore, also in data hidingusing side information at the encoder. A number of prior workshave studied the use of state-of-the-art channel coding for sideinformeddata hiding [3, 4, 5], using near-capacity achieving turbocodes over scalar side informed methods following Costa’s guidelines.The schemes used therein involve scalar uniform quantizerswhich are resized using a scaling factor before quantization —i.e., amounting to distortion compensation—, and hence they areequivalent to DC-DM. It is a fact that the type of channel and thelevel of distortion are necessary for undertaking iterative decoding.Incidentally, all these works have worked under the hypothesis thatthis information was known by the decoder. Here we explore howto perform blind 1 iterative decoding of DC-DM, i.e., without theaforementioned assumptions.Enterprise Ireland is kindly acknowledged for supporting this work underthe Informatics Research Initiative. This work has also been funded bythe Xunta de Galicia under grants PGIDT01 PX132204PM and PGIDT02PXIC32205PN; CYCIT, AMULET project, reference TIC2001-3697-C03-01; and FIS, IM3 Research Network, reference FIS-G03/185.1 This term refers to the ignorance of the channel model by the decoder;not to be confused with blind vs. non-blind data hiding.1.1. FrameworkWe assume that we pseudorandomly choose N samples x = (x[1],...,x[N]) from a host signal; the samples in x are independent identicallydistributed (i.i.d.) zero-mean random variables with covariancematrix Γ x = σ 2 x · I. The corresponding watermarked signal yundergoes a zero-mean random additive attack channel, so that thesignal received at the decoder is z = y + n. The samples of therandom variable n are assumed to be i.i.d. and independent of x,with unknown probability density function (pdf) and variance σ 2 n.In binary DC-DM [2] one information symbol b[k] ∈ {±1} ishidden by quantizing a sample of the host signal x[k] to the nearestcentroid Q b[k] (x[k]) ∈ Λ b[k] belonging to the uniform lattice 2 Λ b[k]given by(b[k] + 1)Λ b[k] = 2∆ Z + ∆ + d[k],2with d[k] a key-dependent value that can we taken as zero for theanalysis. The watermarked signal is obtained asy[k] = x[k] + ν · e[k] = Q b[k] (x[k]) − (1 − ν) · e[k], (1)i.e, the watermark is the quantization error e[k] Q b[k] (x[k])−x[k]weighted by an optimizable constant ν, 0 ≤ ν ≤ 1. The relation∆ ≪ σ x usually holds true due to perceptual reasons. Then, for awide range of hosts, e[k] can be assumed to be independent of x[k]and uniformly distributed, e[k] ∼ U(−∆,∆). Then, the watermarkw[k] = y[k] − x[k] is also uniform, and the embedding power isE{w 2 [k]} = ν 2 ∆ 2 /3. The decoder acts by quantizing sample bysample the received signal z to the closest codebook lattice. Hencewe have thatˆb[k] = arg min ∣ Qb (z[k]) − z[k] ∣ . (2)b∈{±1}Following what we stated in the introduction, we will hide a binarycodeword c = (c[1],...,c[N]) instead of hiding N uncoded bits usingthe previous scheme. The codeword is obtained by encodinga binary information vector b = (b[1],...,b[M]), M < N, using arate R = M/N code. For embedding and decoding we will considerthat the codeword symbols are given in antipodal form, i.e.,c[k] ∈ {±1}. We will center our attention in parallel concatenatedcodes with iterative decoding, i.e., turbo codes. We recall that theparallel concatenated turbo codewords have the formc = (c s | c p 1| c p 2), (3)2 Extending the usual definition of lattice, which in principle must includethe origin.

PSfrag replacementszChannelL(b) L(b) ′Maximization θ∗ BCJRModelh(θ ∗ (Expectation),·)Fig. 1. One step of the iterative EM algorithm intertwined with iterative turbo decoding. Necessary interleavings/deinterleavings of z s andL(b) for BCJR are not explicitly shown for simplicity.where the subvector c s = b is the systematic output, and the subvectorsc p 1and c p 2are the parity outputs corresponding to the constituentrecursive systematic convolutionals (RSC’s).The choice of ν is important because there is a different optimumat each watermark-to-noise ratio (WNR) for the achievablerate of DC-DM [4]. The WNR = 10log 10 ν 2 ∆ 2 /(3σ 2 n) is notknown beforehand by the encoder, as he/she cannot know σ n . Previousworks [3, 4, 5] have worked under this assumption, and sothey have used the optimal scaling of their lattices —i.e., the optimaldistortion compensation factor ν— at each WNR. Here, wewill use a fixed ν regardless of the WNR, what is more realistic.Turbo codes present a distinctive waterfall of the decoded bit errorrate at a WNR value relatively close to the minimum for asymptoticallyerrorless decoding. Then, we can approximately choose theoptimal ν as the one that corresponds to the WNR at the achievablerate R imposed by the turbo code. As the code cannot be perfect,the optimum will actually correspond to a slightly higher WNR.Notice however that this choice requires knowledge of the channelmodel for computing the achievable rate vs. WNR plots [4], but itis all it can be done. In addition, this optimization does not holdfor WNR’s more negative than the waterfall area, but this is unimportantdue to the high probabilities of error associated to turbodecoding in this range.2. EXACT ITERATIVE **DECODING** **OF** DC-DMFirst, we will explain the way to exactly establish the reliability ofthe channel decisions when the channel model is known by the decoderto be Gaussian with variance σ 2 n. The decoder receives thenoisy signal z and proceeds to perform MAP iterative decoding.This requires the probabilities p(z[k] | c[k] = c), c ∈ {±1}, for computingthe reliability log-likelihood ratios. Considering (1), wehave that y[k] is uniform, and then that the pdf of z[k] = y[k] + n[k]is the convolution of a uniform and a Gaussian pdf’s. We can putthis pdf as f (z[k]) ∗ δ{z[k] − Q c (x[k])}, withf (z) 12(1 − ν)∆{Q( )z − (1 − ν)∆− Qσ n( )}z + (1 − ν)∆,(4)σ nand Q(z) ∫ ∞z exp(−x 2 /2)/ √ 2π dx. This pdf of z[k] is conditionedto a concrete centroid assumption, but we need the pdf for ageneric symbol decision. For obtaining this expression notice that,due to using (2) at the decoder, the decision ĉ[k] can be seen asbeing based on the modular offsets{}(c + 1)˜z c [k] {z[k] mod Λ c } − ∆ = z[k] + ∆ mod 2∆ − ∆2(5)to each one of the two lattices Λ c , with c ∈ {−1,1}. Using theseoffsets, the minimum distance decision can be rewritten asĉ[k] = argmin c∣ ∣˜z c [k] ∣ ∣. (6)Considering (6), it is clear that the reliability measure for the decisionĉ[k] = c is justp(z[k] | c[k] = c) ˜f (˜z c [k]),with ˜f (·) the pdf followed by ˜z c [k]. Notice that the operation (5)implies that this pdf is just the aliasing of the sections of (4) correspondingto the Voronoi regions of the lattice 2∆Z, that is˜f (z) ={∑w∈2∆Z f (z − w), |z| ≤ ∆0, |z| > ∆ . (7)3. **BLIND** ITERATIVE **DECODING** **OF** DC-DMWe assume next that the decoder does not know (7). In the communicationsfield, we can find can some approaches that estimateblindly the pdf of an unknown additive channel such as the oneby Li et al. [6], who propose to heuristically refine a kernel-basedmodel at each iterative decoding step using the increasingly accurateintermediate decoded information. We will follow a similarapproach, but using sounder theoretical grounds. Taking profit thatthe support set of ˜f (z) is limited to |z| < ∆, we can resort to approximating(7) using a simple but general model based on a finitenumber N q of rectangular kernels. This model depends on the parametersvector θ = (θ[1],...,θ[N q ]) and it is given byN q∑h(θ,z) θ[i] · Π ( z − (i − 1) · ∆ q + ∆ ) . (8)i=1In the expression above the kernels Π(z) are defined as{1/∆q , 0 < z ≤ ∆Π(z) q, (9)0, otherwisewith ∆ q 2∆/N q , which we assume integer. Of course, h(θ,z) = 0for |z| > ∆. Notice that a further advantage of (8) is that it makesno assumptions on the symmetry of the attack pdf. This model isusually considered to be nonparametric, although we can see it asa parametric one in which θ has to be adjusted.Our objective is therefore to optimally estimate θ from the receivedvector z. The maximum likelihood approach for this estimationcan be stated asˆθ = maxP(z,θ). (10)θ

This estimation problem is inherently involved. Still, we may noticethat the elements of z stem from the mixture of data drawnfrom two different distributions. At each z[k] these two possibledistributions (which are in fact the same one shifted by theoffset ∆) correspond to each of the two possible embedded symbolsc[k] ∈ {±1}. This is the situation for which the Expectation- PSfrag replacementsMaximization (EM) algorithm [7] was conceived, aiming at findingthe solution of (10) iteratively with theoretically proven convergenceproperties. Unfortunately, we cannot afford the hypothesisof independence between the elements of z that correspond tothe codeword parities, what obscures the solution to (10). For thisreason we will resort to solving insteadˆθ = maxθP(zs ,θ),with z s the subvector of z corresponding to the systematic partc s = b of the codeword c, following the notation in (3). Anyway,and as we will see next, the turbo code can be used to improve theEM algorithm beyond what we could get with z s alone. In this way,we can intertwine the iterative turbo decoding with the iterativeestimation problem. We describe next the two steps of the EMalgorithm and their application to our problem, that is summarizedin Figure 1.1. Expectation Step. This step is equivalent to computinga probability mass function (pmf) of c s = b (hidden data)under the knowledge of z s and θ, that isq(b) P(b | z s ,θ). (11)Actually, each iterative turbo decoding stage optimally updatesthe previous extrinsic pmf of b using the BCJR algorithm,which takes into account z (and not only z s ), the codeused for the current parity, and the channel model given byθ. Therefore, the probabilities q(b[k]), for k = 1,...,M,given by the BCJR algorithm, are the best way to compute(11). Assuming that the information bits b[k] are independent,we can writeMq(b) = ∏ q(b[k]). (12)k=1Recall that we can straightforwardly compute these probabilitiesfrom the log-likelihood ratios L(b[k]) = log{q(b[k] =+1)/q(b[k] = −1)}.2. Maximization Step. Now, using the pdf (12) and z s weneed to compute the new θ that maximizes the EM functional[7], that can be written asmaxθ E q(b){logP(z s ,b,θ)}. (13)It is shown in Appendix A that the solution θ ∗ to this optimizationproblem is given by the expression (17).After the maximization step we may go back to the expectationstep, for which a new iteration of turbo decoding is performedusing the increasingly more reliable pdf updated using (17) (seeFigure 1). This procedure is continued until convergence.In order to gain further insight from (17) we can consider touse, instead of the soft values q(b[k]), the decisions ˆb[k] = sign L(b[k])Pb10 010 −110 −210 −310 −410 −510 −6N q = 4N q = 8N q = 16N q = 320.5 1 1.5 2 2.5 3WNRFig. 2. Gaussian noise. Performance of turbo-coded DC-DM withblind decoding for pdf models with different resolutions.in that equation. With this choice the pmf’s become as a matter offact deterministic, as if q(b[k] = +1) = 1 then q(b[k] = −1) = 0,and vice versa. Interestingly, in this suboptimal case (17) becomesthe normalized histogram of z s on the bins B i defined in the Appendix,using the hard decisions ˆb[k] to make the bin assignmentsof the corresponding z s [k]. This decision-based approximation,that would be the intuitive way to update θ in the EM iterative process(see [6]), achieves convergence in less steps, and generally toa good approximation of the real optimum.Last, there is partial information available for the initializationof θ, using the symbol-by-symbol hard decisions (2) that wouldbe made if the received codeword were just considered as uncodedinformation. These hard decisions can be used to make the initialcomputation of (17), just as we have explained in the precedingsimplification of the method. Nevertheless, notice that with thisapproach only values of h(θ,z) corresponding to |z| < ∆/2 canbe initialized. All we can do in this initial iteration is to set theremaining values to a uniform non-zero value, and normalizing (8)so that it remains a pdf. These values cannot be initialized to zero,because these “impossible values” would penalize unacceptablythe performance of the iterative decoding.4. EXPERIMENTAL RESULTSWe present next some results of the tests carried out using turbocoding and the suboptimal intuitive updates of θ. We use the RSC(1 27/31), a pseudorandom interleaver with size M = 1000 andν = 0.65. First we show in Figure 2 the decoding performanceof the blind decoder proposed in front of Gaussian noise, for a pdfmodel (8) consisting of N q kernel functions. We could tend to thinkthat, the higher the number of kernel functions, the more accuratethe estimation we could get. In principle this is true, but as theresolution N q increases so does the variance of θ, and thereforethe estimated pdf becomes eventually too noisy and useless fordecoding, as we can see in the figure for values N q > 8.For non-Gaussian distortions the gain due to using a blind decoderinstead of a Gaussian-matched one should be displayed. Fora fair comparison we assume that the Gaussian-matched decoder

10 010 −1[7] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximumlikelihoodfrom incomplete data via the EM algorithm,” J.Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–38,1977.10 −2PSfrag replacementsPb10 −3A. OPTIMAL UPDATE **OF** THE PARAMETERS10 −410 −5Blind, N q = 4Gaussian-matched, non-iterative ˆσ 2 nGaussian-matched, iterative ˆσ 2 n1 1.5 2 2.5 3 3.5WNRFig. 3. Uniform noise. Performance comparison of blind decodingversus Gaussian-matched decoding with noise variance estimation.estimates the noise power σ 2 n using the expressionˆσ 2 n = 1 M [M∑ Qˆb[k] (z[k]) − z[k]] 2 − (1 − ν) 2 ∆ 2 /3,k=1and iteratively refining this estimation over successive decodingsteps. In Figure 3 we show the performance obtained with thisapproach versus blind decoding when the attack is uniform i.i.d.noise. We also show the Gaussian-matched decoder with a noniterativeestimation of σ 2 n, stressing the importance of having agood estimate of the channel variance in order to correctly decodethe turbo-coded information. We can see that the blind method isable to yield a gain over the less adaptive Gaussian-matched one.5. REFERENCES[1] Max H.M. Costa, “Writing on dirty paper,” IEEE Trans. onInformation Theory, vol. 29, no. 3, pp. 439–441, May 1983.[2] Brian Chen and Gregory W. Wornell, “Quantization indexmodulation: A class of provably good methods for digital watermarkingand information embedding,” IEEE Trans. on InformationTheory, vol. 47, no. 4, pp. 1423–1443, May 2001.[3] M. Kesal, M. K. Mıhçak, R. Koetter, and P. Moulin, “Iterativelydecodable codes for watermarking applications,” inProc. 2nd Symposium on Turbo Codes and Their Applications,Brest, France, September 2000.[4] J.J. Eggers, R. Bäuml, R. Tzschoppe, and B. Girod, “Scalarcosta scheme for information embedding,” IEEE Trans. onSignal Processing, vol. 51, no. 4, pp. 1003–1019, April 2003.[5] J. Chou, S. Pradhan, and K. Ramchandran, “Turbo codedtrellis-based constructions for data embedding: Channel codingwith side information,” in Proc. of Asilomar Conferenceon Signals, Systems and Computers, Pacific Grove, USA, October2001.[6] Yuan Li and Kwok H. Li, “Iterative PDF estimation and decodingfor CDMA systems with non-Gaussian characterization,”IEE Electronics Letters, vol. 36, no. 8, pp. 730–731,April 2000.Assuming independence of the samples in z s and b we can write (13)asF(θ) E q(b) {logP(z s ,b,θ)}M= ∑ E q(b) {logP(z s [k],b[k],θ)}.k=1Using again the independence of the b[k], we can writeMF(θ) = ∑ E q(b[k]) {logP(z s [k],b[k],θ)}k=1M= ∑ ∑ q(b[k] = b)logP(z s [k],b[k] = b,θ). (14)k=1 b=±1We will find it convenient next to rewrite (14) using some usefuldefinitions. First, we define the intervals B i of the support set correspondingto the i-th kernel in (8), that is, B i ( (i−1)·∆ q −∆, i·∆ q − ∆ ] , with i = 1,...,N q . Using them we can define in turn thesets of indicesPb i {k | ˜zs b [k] ∈ B i},with b = ±1, i = 1,...,N q , and ˜z s b [k] the modularization (5) appliedon z s [k]. Now, (14) can be put asN q∑∑∑F(θ) =q(b[k] = b)logθ[i]. (15)i=1 b=±1 k∈PbiAccording to (13) we have now to maximize (15) with the restriction∑ N qi=1 θ[i] = 1, that guarantees that (8) is a pdf. To this end, webuild the Lagrangian(Nq)L(θ) = F(θ) − γ ∑ θ[i] − 1i=1.Differentiating with respect to θ[i], and equating to zero to obtainthe extreme, we can write∂L(θ)∂θ[i]= ∑ ∑ q(b[k] = b) 1b=±1θ[i] − γ = 0,k∈Pbifor i = 1,...,N q . The solution is a maximum due to the negativenessof the second derivative. In order to solve the Lagrangemultiplier γ we just plug the solution of the equation above into therestriction obtainingN q∑∑∑γ =q(b[k] = b). (16)i=1 b=±1 k∈PbiAs q(b) is a pmf, and as we are summing up in (16) the pmf’s forevery b[k], we have that γ = M. Therefore, the optimal parametervector θ ∗ is given by the expressionθ ∗ [i] = ∑ b=±1 ∑ k∈P i q(b[k] = b)b, i = 1,...,N q . (17)M