Notes on computational linguistics.pdf - UCLA Department of ...
Notes on computational linguistics.pdf - UCLA Department of ...
Notes on computational linguistics.pdf - UCLA Department of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Stabler - Lx 185/209 2003<br />
Entropy<br />
(112) Often we are interested not in the informati<strong>on</strong> c<strong>on</strong>veyed by a particular event, but by the informati<strong>on</strong><br />
c<strong>on</strong>veyed by an informati<strong>on</strong> source:<br />
…from the point <strong>of</strong> view <strong>of</strong> engineering, a communicati<strong>on</strong> system must face the problem <strong>of</strong><br />
handling any message that the source can produce. If it is not possible or practicable to design<br />
a system which can handle everything perfectly, then the system should handle well the jobs<br />
it is most likely to be asked to do, and should resign itself to be less efficient for the rare task.<br />
This sort <strong>of</strong> c<strong>on</strong>siderati<strong>on</strong> leads at <strong>on</strong>ce to the necessity <strong>of</strong> characterizing the statistical nature<br />
<strong>of</strong> the whole ensemble <strong>of</strong> messages which a given kind <strong>of</strong> source can and will produce. And<br />
informati<strong>on</strong>, as used in communicati<strong>on</strong> theory, does just this. (Weaver, 1949, p14)<br />
(113) For a source X, the average informati<strong>on</strong> <strong>of</strong> an arbitrary outcome in ΩX is<br />
H = <br />
P(A)i(A) =− <br />
P(A)log P(A)<br />
A∈ΩX<br />
(114)<br />
This is sometimes called “entropy” <strong>of</strong> the random variable – the average number <strong>of</strong> bits per event<br />
(Charniak, 1993, p29). So called because each P(A) gives us the “proporti<strong>on</strong>” <strong>of</strong> times that A occurs.<br />
For a source X <strong>of</strong> an infinite sequence <strong>of</strong> events, the entropy or average informati<strong>on</strong>, the entropy <strong>of</strong> the<br />
source is usually given as their average probability over an infinite sequence X1,X2,..., easily calculated<br />
from the previous formula to be:<br />
Gn<br />
H(X) = lim<br />
n→∞ n<br />
where Gn =<br />
−n <br />
... <br />
P(X1 = A1,X2 = A2,...,Xn = An) log P(X1 = A1,X2 = A2,...,Xn = An)<br />
A1∈ΩX A2∈ΩX<br />
An∈ΩX<br />
(115) When the space ΩX c<strong>on</strong>sists <strong>of</strong> independent time-invariant events whose uni<strong>on</strong> has probability 1, then<br />
Gn =−n <br />
P(A)log P(A),<br />
A∈ΩX<br />
A∈ΩX<br />
and so the entropy or average informati<strong>on</strong> <strong>of</strong> the source in the following way:<br />
H(X) = <br />
P(A)i(A) =− <br />
P(A)log P(A)<br />
A∈ΩX<br />
Charniak (1993, p29) calls this the per word entropy <strong>of</strong> the process.<br />
(116) If we use some measure other than bits, a measure that allows r -ary decisi<strong>on</strong>s rather than just binary<br />
<strong>on</strong>es, then we can define Hr (X) similarly except that we use logr rather than log2. (117) Shann<strong>on</strong> shows that this measure <strong>of</strong> informati<strong>on</strong> has the following intuitive properties (as discussed<br />
also in the review <strong>of</strong> this result in Miller and Chomsky (1963, pp432ff)):<br />
a. Adding any number <strong>of</strong> impossible events to ΩX does not change H(X).<br />
b. H(X) is a maximum when all the events in ΩX are equiprobable.<br />
(see the last graph <strong>on</strong> page 154)<br />
c. H(X) is additive, in the sense that H(Xi ∪ Xj) = H(Xi) + H(Xj) when Xi and Xj are independent.<br />
156<br />
A∈ΩX