Notes on computational linguistics.pdf - UCLA Department of ...
Notes on computational linguistics.pdf - UCLA Department of ...
Notes on computational linguistics.pdf - UCLA Department of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Stabler - Lx 185/209 2003<br />
One indirect argument for stochastic models <strong>of</strong> this kind could come from the presentati<strong>on</strong> <strong>of</strong> a theory<br />
<strong>of</strong> human language acquisiti<strong>on</strong> based <strong>on</strong> stochastic grammars.<br />
8.1.16 Codes<br />
(121) Shann<strong>on</strong> c<strong>on</strong>siders the informati<strong>on</strong> in a discrete, noiseless message. Here, the space <strong>of</strong> possible events<br />
ΩX isgivenbyanalphabet(or“vocabulary”)Σ.<br />
A fundamental result is Shann<strong>on</strong>’s result that the entropy <strong>of</strong> the source sets a lower bound <strong>on</strong> the size<br />
<strong>of</strong> the messages. We present this result in §129 below after setting the stage with the basic ideas we<br />
need.<br />
(122) Sayood (1996, p26) illustrates some basic points about codes with some examples. C<strong>on</strong>sider:<br />
message code 1 code 2 code 3 code 4<br />
a 0 0 0 0<br />
b 0 1 10 01<br />
c 1 00 110 011<br />
d 10 11 111 0111<br />
avg length 1.125 1.125 1.75 1.875<br />
Notice that baa in code 2 is 100. But 100 is also the encoding <strong>of</strong> bc.<br />
We might like to avoid this. Codes 3 and 4 have the nice property <strong>of</strong> unique decodability. That is, the<br />
map from message sequences to code sequences is 1-1.<br />
(123) C<strong>on</strong>sider encoding the sequence<br />
91111111413151716172021<br />
a. To transmit these numbers in binary code, we would need 5 bits per element.<br />
b. To transmit 9 different digits: 9, 11, 13, 14, 15, 16, 17, 20, 21, we could hope for a somewhat better<br />
code! 4 bits would be more than enough.<br />
c. An even better idea: notice that the sequence is close to the functi<strong>on</strong> f(n)= n+8 forn∈{1, 2,...}<br />
The perturbati<strong>on</strong> or residual Xn−f(n)= 0, 1, 0, −1, 1, −1, 0, 1, −1, −1, 1, 1, so it suffices to transmit<br />
the perturbati<strong>on</strong>, which <strong>on</strong>ly requires two bits.<br />
(124) C<strong>on</strong>sider encoding the sequence,<br />
27 28 29 28 26 27 29 28 30 32 34 36 38<br />
This sequence does not look quite so regular as the previous case.<br />
However, each value is near the previous <strong>on</strong>e, so <strong>on</strong>e strategy is to let your receiver know the starting<br />
point and then send just the changes:<br />
(27)11-1-212-122222<br />
(125) C<strong>on</strong>sider the follow sequence <strong>of</strong> 41 elements, generated by a probabilistic source:<br />
axbarayaranxarrayxranxfarxfaarxfaaarxaway<br />
Thereare8symbolshere,sowecoulduse3bitspersymbol.<br />
On the other hand, we could use the following variable length code:<br />
a 1<br />
x 001<br />
b 01100<br />
f 0100<br />
n 0111<br />
r 000<br />
w 01101<br />
y 0101<br />
158