2**2.**1. Basic notions about source codingGeneralitiesLet A = {a 1 , … , a n } source alphabet, so that the source emits a streamof symbols from A. At this stage we assume that the source is memoryless,which means that if X i denotes the i-th symbol (so X i is a randomvariable with values in A), thenp j = P(X i = a j )is independent of i, and hence also independent of all other symbols inthe stream.Let C be a coding alphabet of r symbols, as for example the binary alphabet(r = 2). An encoding of A is a mapf: A → C ∗ , a i ↦ f(a i ).The elements of list C f = {f(a 1 ), … , f(a n )} will be called the codewordsof the encoding f.

5**2.****2.** Construction of prefix **codes**Suppose f: A → C ∗ is an encoding. Let C j ⊆ C j , j = 1, … , l, be the set ofcode words of length j, where l denotes the maximum lengths of codewords. Let n j = C j (the number of elements of C j ). If f is prefix, thenwe haven 1 + ⋯ + n l = n andn l ≤ r l − n 1 r l−1 − ⋯ − n l−1 r.[∗]The first relation is clear, as C 1 ∪ ⋯ ∪ C l is the set of code words andthere are n of these. As for the inequality, note that there are preciselyn j r l−j elements in C l having an element of C j as prefix (we can add anyl − j symbols to any of the n j elements of C j ). Excluding all these elements,we are left with a subset of C l with r l − n 1 r l−1 − ⋯ − n l−1 r thatmust contain C l , and so [∗] is also clear.Now it is interesting to see that the converse is also true.

10To prove the second part, let l j be the least integer such that r l j≥ 1/p j .Then p j ≥ r −l jand ∑ r −l jj ≤ ∑ j p j = 1. Thus the l j satisfy the Kraft-McMillan inequality and hence there is a prefix (sic) encoding f: A → C ∗with code lengths l 1 , … , l n . But l j < 1 − log 2 p j / log 2 (r), by E.**2.**2, andhence l = ∑ j p j l j< 1 + H/log 2 (r).E.**2.****2.** Show that the least integer such that r l j≥ 1/p j is− log 2 p j / log 2 (r).In particular, l j < 1 − log 2 p j / log 2 (r).E.**2.**3. Use the Kraft-McMillan inequality to show that the setC = {0, 01, 11, 101}is not a code and find a binary string that can be decomposed in two differentways as a concatenation of elements of C.

11**2.**4. Optimal encodings and the **Huffman**’s algorithmAn encoding f: A → C ∗ is said to be optimal, or compact, if l has theminimum possible value. In the search for such encodings, we know thatit is enough to look for prefix encodings. Moreover, l must satisfy theconstraints given by Shannon’s source coding theorem:H⁄ log 2 (r) ≤ l < 1 + H/ log 2 (r).Optimal prefix binary encodings f: A → {0,1} ∗ (**Huffman** encodings) wereconstructed by **Huffman** in 195**2.** The purpose of this section is to describein detail his procedure.We will explain first how the algorithm works in two examples. Given thesource symbols (a 1 , … , a n ) and their probabilities (p 1 , … , p n ), we willbuild a binary tree of depth n − 1 and having a 1 , … , a n as leaves. The left(right) edges are labeled 0 (1). The encoding of a symbol is done by readingthe labels found in going from the root to the symbol leaf.

13Example II (cf. Blelloch-2010, p. 15-16).We take again the alphabet {a, b, c, d, e},with probabilities {0.2, 0.4, 0.2, 0.1, 0.1}.In this case, **Huffman**’s algorithm constructsthe tree shown in the figure andhence it outputs the encodinga → 10, b → 00, c → 11,d → 010, e → 011.**Huffman**: Example IIThe smallest probabilities are p(d) = p(e) = 0.1. We proceed with themas in the previous example, yielding the node 0.2 that ties with a and c.Since these were known before, we create the node 0.4, that ties with b.Now we combine 0.4 = p(b), known before, with the 0.2, which is nowthe smallest value, yielding the node 0.6.

14Remark. If we do not obey the precedence rules explained in the examples,we still get an optimal prefix code. With the precedence rules, itturns out that the lengths of the codewords have the lowest possible variance.N4Algorithm (**Huffman**). Input: the symbol probabilities. Output: a binarytree with nodes labeled with numbers in the range (0,1] and edges labeledwith 0 or 1.Description. Start. Consider each symbol as a depth 0 tree labeled withits probability.Repeat. Do until a single tree remains (which will happen in < n steps):- select two trees with the lowest weight roots (say w and w′).- combine the two trees into a single tree by adding a new root withweight w + w′, and making the two trees its children.For a proof of the correctness of the algorithm, including the strengtheningexplained in the note, see Blelloch-2010.

15Example III (cf. [MacKay-2003])Data for monogram EnglishL j = log 2 (1/p j )a j p j L j l j f(a j ) a j p j L j l j f(a j )a 0.0575 4.1 4 0000 o 0.0689 3.9 4 1011b 0.0128 6.3 6 001000 p 0.0192 5.7 6 111001c 0.0263 5.2 5 00101 q 0.0008 10.3 9 110100001d 0.0285 5.1 5 10000 r 0.0508 4.3 5 11011e 0.0913 3.5 4 1100 s 0.0567 4.1 4 0011f 0.0173 5.9 6 111000 t 0.0706 3.8 4 1111g 0.0133 6.2 6 001001 u 0.0334 4.9 5 10101h 0.0313 5.0 5 10001 v 0.0069 7.2 8 11010001i 0.0599 4.1 4 1001 w 0.0119 6.4 7 1101001j 0.0006 10.7 10 1101000000 x 0.0073 7.1 7 1010001k 0.0084 6.9 7 1010000 y 0.0164 5.9 6 101001l 0.0335 4.9 5 11101 z 0.0007 10.4 10 1101000001m 0.0235 5.4 6 110101 _ 0.1928 **2.**4 2 01n 0.0596 4.1 4 0001

16jzanbgcsdhkixyuoeqvwmrfplt_**Huffman** code for monogram EnglishRemark. The entropy of monogram English is H = 4.1094, while the averagecodeword length is 4.1462 ≥ H (Example entropy-english1.cc). N5

17NotesN1 (p. 3). If fa i1 fa i2 ⋯ fa ik = fa i1′ fa i2′ ⋯ f(a ik′ ), then eitherf(a i1 ) is a prefix of fa i1′ or fa i1′ is a prefix of fa i1 or fa i1 = fa i1′ . Ifthe code is instantaneous, only the last possibility can occur, and so a i1 = a i1′ .Now the claim follows easily by induction.N1 (p. 3). Work backwards from the end of a given binary string. If it ends with0, this 0 is uniquely decoded as x and we can proceed by induction. If it endswith 1 and the previous bit is 0, the last two bits are uniquely decoded as y, andwe can again proceed by induction. If the last two bits are 11, then the binarystring is not decodable.N3 (p. 7). Let f: A → C ∗ be an encoding with word lengths l 1 , … , l n . Then wecan write, for any positive integer s, 1r l 1 + ⋯ + 1r l n s = ν 1r + ν 2r 2 + ⋯ + ν lsr ls = ν 1r + ν 2r 2 + ⋯ + ν lsr ls ,

18where ν j ≥ 0 is the number of ways in which j can be decomposed into a sumof s integers from the set {l 1 , … , l n }, or the number of ways in which a string inC j can be obtained by concatenating s code words. If the code is ud, then weclearly must have ν j ≤ r j , and hence 1r l 1 + ⋯ + 1r l n s = ls, or1+ ⋯ + 1r l 1 r l ns≤ l1 ⁄ s 1/s ,and letting s go to ∞ we get Kraft’s inequality.N4 (p. 14). The variance of the codelengths is ∑ p j l j − l 2j. Remember thatl = ∑ j p j l j . In the example it is equal to 1.7895.N5 (p. 15) Here is a listing of the overloaded function entropy (in cc):entropy(p:Real) check (0

19entropy(P:Vector|List):=beginlocal S=0for p in P doif (p>1)|(p

20References[Blelloch-2010] Guy E. Blelloch: Introduction to data compression.http://www.cs.cmu.edu/~guyb/realworld/compression.pdf.[MacKay-2003] David J.C. MacKay: Information theory, inference and learningalgorithms. Cambridge university press, 2003. xii+628p.