20.07.2013 Views

Notes on computational linguistics.pdf - UCLA Department of ...

Notes on computational linguistics.pdf - UCLA Department of ...

Notes on computational linguistics.pdf - UCLA Department of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Stabler - Lx 185/209 2003<br />

Entropy<br />

(112) Often we are interested not in the informati<strong>on</strong> c<strong>on</strong>veyed by a particular event, but by the informati<strong>on</strong><br />

c<strong>on</strong>veyed by an informati<strong>on</strong> source:<br />

…from the point <strong>of</strong> view <strong>of</strong> engineering, a communicati<strong>on</strong> system must face the problem <strong>of</strong><br />

handling any message that the source can produce. If it is not possible or practicable to design<br />

a system which can handle everything perfectly, then the system should handle well the jobs<br />

it is most likely to be asked to do, and should resign itself to be less efficient for the rare task.<br />

This sort <strong>of</strong> c<strong>on</strong>siderati<strong>on</strong> leads at <strong>on</strong>ce to the necessity <strong>of</strong> characterizing the statistical nature<br />

<strong>of</strong> the whole ensemble <strong>of</strong> messages which a given kind <strong>of</strong> source can and will produce. And<br />

informati<strong>on</strong>, as used in communicati<strong>on</strong> theory, does just this. (Weaver, 1949, p14)<br />

(113) For a source X, the average informati<strong>on</strong> <strong>of</strong> an arbitrary outcome in ΩX is<br />

H = <br />

P(A)i(A) =− <br />

P(A)log P(A)<br />

A∈ΩX<br />

(114)<br />

This is sometimes called “entropy” <strong>of</strong> the random variable – the average number <strong>of</strong> bits per event<br />

(Charniak, 1993, p29). So called because each P(A) gives us the “proporti<strong>on</strong>” <strong>of</strong> times that A occurs.<br />

For a source X <strong>of</strong> an infinite sequence <strong>of</strong> events, the entropy or average informati<strong>on</strong>, the entropy <strong>of</strong> the<br />

source is usually given as their average probability over an infinite sequence X1,X2,..., easily calculated<br />

from the previous formula to be:<br />

Gn<br />

H(X) = lim<br />

n→∞ n<br />

where Gn =<br />

−n <br />

... <br />

P(X1 = A1,X2 = A2,...,Xn = An) log P(X1 = A1,X2 = A2,...,Xn = An)<br />

A1∈ΩX A2∈ΩX<br />

An∈ΩX<br />

(115) When the space ΩX c<strong>on</strong>sists <strong>of</strong> independent time-invariant events whose uni<strong>on</strong> has probability 1, then<br />

Gn =−n <br />

P(A)log P(A),<br />

A∈ΩX<br />

A∈ΩX<br />

and so the entropy or average informati<strong>on</strong> <strong>of</strong> the source in the following way:<br />

H(X) = <br />

P(A)i(A) =− <br />

P(A)log P(A)<br />

A∈ΩX<br />

Charniak (1993, p29) calls this the per word entropy <strong>of</strong> the process.<br />

(116) If we use some measure other than bits, a measure that allows r -ary decisi<strong>on</strong>s rather than just binary<br />

<strong>on</strong>es, then we can define Hr (X) similarly except that we use logr rather than log2. (117) Shann<strong>on</strong> shows that this measure <strong>of</strong> informati<strong>on</strong> has the following intuitive properties (as discussed<br />

also in the review <strong>of</strong> this result in Miller and Chomsky (1963, pp432ff)):<br />

a. Adding any number <strong>of</strong> impossible events to ΩX does not change H(X).<br />

b. H(X) is a maximum when all the events in ΩX are equiprobable.<br />

(see the last graph <strong>on</strong> page 154)<br />

c. H(X) is additive, in the sense that H(Xi ∪ Xj) = H(Xi) + H(Xj) when Xi and Xj are independent.<br />

156<br />

A∈ΩX

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!