20.07.2013 Views

Notes on computational linguistics.pdf - UCLA Department of ...

Notes on computational linguistics.pdf - UCLA Department of ...

Notes on computational linguistics.pdf - UCLA Department of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Stabler - Lx 185/209 2003<br />

One indirect argument for stochastic models <strong>of</strong> this kind could come from the presentati<strong>on</strong> <strong>of</strong> a theory<br />

<strong>of</strong> human language acquisiti<strong>on</strong> based <strong>on</strong> stochastic grammars.<br />

8.1.16 Codes<br />

(121) Shann<strong>on</strong> c<strong>on</strong>siders the informati<strong>on</strong> in a discrete, noiseless message. Here, the space <strong>of</strong> possible events<br />

ΩX isgivenbyanalphabet(or“vocabulary”)Σ.<br />

A fundamental result is Shann<strong>on</strong>’s result that the entropy <strong>of</strong> the source sets a lower bound <strong>on</strong> the size<br />

<strong>of</strong> the messages. We present this result in §129 below after setting the stage with the basic ideas we<br />

need.<br />

(122) Sayood (1996, p26) illustrates some basic points about codes with some examples. C<strong>on</strong>sider:<br />

message code 1 code 2 code 3 code 4<br />

a 0 0 0 0<br />

b 0 1 10 01<br />

c 1 00 110 011<br />

d 10 11 111 0111<br />

avg length 1.125 1.125 1.75 1.875<br />

Notice that baa in code 2 is 100. But 100 is also the encoding <strong>of</strong> bc.<br />

We might like to avoid this. Codes 3 and 4 have the nice property <strong>of</strong> unique decodability. That is, the<br />

map from message sequences to code sequences is 1-1.<br />

(123) C<strong>on</strong>sider encoding the sequence<br />

91111111413151716172021<br />

a. To transmit these numbers in binary code, we would need 5 bits per element.<br />

b. To transmit 9 different digits: 9, 11, 13, 14, 15, 16, 17, 20, 21, we could hope for a somewhat better<br />

code! 4 bits would be more than enough.<br />

c. An even better idea: notice that the sequence is close to the functi<strong>on</strong> f(n)= n+8 forn∈{1, 2,...}<br />

The perturbati<strong>on</strong> or residual Xn−f(n)= 0, 1, 0, −1, 1, −1, 0, 1, −1, −1, 1, 1, so it suffices to transmit<br />

the perturbati<strong>on</strong>, which <strong>on</strong>ly requires two bits.<br />

(124) C<strong>on</strong>sider encoding the sequence,<br />

27 28 29 28 26 27 29 28 30 32 34 36 38<br />

This sequence does not look quite so regular as the previous case.<br />

However, each value is near the previous <strong>on</strong>e, so <strong>on</strong>e strategy is to let your receiver know the starting<br />

point and then send just the changes:<br />

(27)11-1-212-122222<br />

(125) C<strong>on</strong>sider the follow sequence <strong>of</strong> 41 elements, generated by a probabilistic source:<br />

axbarayaranxarrayxranxfarxfaarxfaaarxaway<br />

Thereare8symbolshere,sowecoulduse3bitspersymbol.<br />

On the other hand, we could use the following variable length code:<br />

a 1<br />

x 001<br />

b 01100<br />

f 0100<br />

n 0111<br />

r 000<br />

w 01101<br />

y 0101<br />

158

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!