05.01.2017 Views

December 11-16 2016 Osaka Japan

W16-46

W16-46

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Algorithm 1 Online Sentence Segmenter<br />

Require: w 0 , w 1 , w 2 , . . .,<br />

1: W ← []; S ← []<br />

2: for w k in stream of words do<br />

3: W ← W + [w k ] ⊲ assume W = [w 0 , w 1 , . . . , w k−1 , w k ]<br />

4: s k−1 ← confidence of segmenting before w k<br />

5: S ← S + [s k−1 ] ⊲ assume S = [s 0 , s 1 , . . . , s k−1 ]<br />

6: B ← apply segmentation strategy to S ⊲ assume B = [b 0 , b 1 , . . . , b k−1 ]<br />

7: if b i = 1 (0 ≤ i ≤ k − 1) then<br />

8: output [w 0 , w 1 , . . . , w i ] as a segment<br />

9: remove first i elements from W and S<br />

10: end if<br />

<strong>11</strong>: end for<br />

segmentation strategy outputs no boundary, no action is taken (represented by (c) in Figure 2). Figure 2, a<br />

segment will be output and the inner sequence will be updated accordingly (as in the process represented<br />

by (d) in Figure 2).<br />

The following two subsections describe the boundary confidence scores and segmentation strategies<br />

in detail, respectively.<br />

3.1 Segment Boundary Confidence Score<br />

This confidence score is based on an N-gram language model. Suppose the language model order is n.<br />

The confidence score represents the plausibility of placing a sentence boundary after the word w i ,<br />

that is, converting the stream of words into · · · , w i−1 , w i , 〈/s〉, 〈s〉, w i+1 , · · · , where 〈/s〉 and 〈s〉 are<br />

sentence start and end markers. The confidence score is based on the ratio of two probabilities arising<br />

from two hypotheses defined below:<br />

Hypothesis I : there is no sentence boundary after word w i . The corresponding Markov chain is,<br />

P 〈I〉<br />

i<br />

= P left · P (wi+1 i+n−1 ) · P right<br />

i+n−1<br />

∏<br />

= P left · p(w k |w k−1<br />

k−n+1 ) · P right (1)<br />

k=i+1<br />

where p denotes the probability from the language model, P left and P right are the probabilities of the left<br />

and right contexts, of the words wi+1 i+n−1 .<br />

Hypothesis II : there is a sentence boundary after the word w i . The corresponding Markov chain is,<br />

P 〈II〉<br />

i<br />

= P left · P (〈/s〉, 〈s〉, w i+n−1<br />

i+1<br />

) · P right<br />

= P left · p(〈/s〉|w i i−n+2) · p(w i+1 |〈s〉) ·<br />

i−n+1<br />

∏<br />

k=i+2<br />

The confidence score is defined as the ratio of the probabilities P 〈II〉<br />

i<br />

s i = P 〈II〉<br />

i<br />

P 〈I〉<br />

i<br />

= p(〈/s〉|w i i−n+2) ·<br />

p(w k |w k−1<br />

i+1 , 〈s〉) · P right (2)<br />

and P 〈I〉<br />

i<br />

, that is,<br />

i−n+1<br />

p(w i+1 |〈s〉)<br />

p(w i+1 |wi−n−2 i ) ·<br />

∏ p(w k |wi+1 k−1 , 〈s〉)<br />

p(w<br />

k=i+2 k |w k−1<br />

k−n+1 ) (3)<br />

142

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!