December 11-16 2016 Osaka Japan
W16-46
W16-46
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Algorithm 1 Online Sentence Segmenter<br />
Require: w 0 , w 1 , w 2 , . . .,<br />
1: W ← []; S ← []<br />
2: for w k in stream of words do<br />
3: W ← W + [w k ] ⊲ assume W = [w 0 , w 1 , . . . , w k−1 , w k ]<br />
4: s k−1 ← confidence of segmenting before w k<br />
5: S ← S + [s k−1 ] ⊲ assume S = [s 0 , s 1 , . . . , s k−1 ]<br />
6: B ← apply segmentation strategy to S ⊲ assume B = [b 0 , b 1 , . . . , b k−1 ]<br />
7: if b i = 1 (0 ≤ i ≤ k − 1) then<br />
8: output [w 0 , w 1 , . . . , w i ] as a segment<br />
9: remove first i elements from W and S<br />
10: end if<br />
<strong>11</strong>: end for<br />
segmentation strategy outputs no boundary, no action is taken (represented by (c) in Figure 2). Figure 2, a<br />
segment will be output and the inner sequence will be updated accordingly (as in the process represented<br />
by (d) in Figure 2).<br />
The following two subsections describe the boundary confidence scores and segmentation strategies<br />
in detail, respectively.<br />
3.1 Segment Boundary Confidence Score<br />
This confidence score is based on an N-gram language model. Suppose the language model order is n.<br />
The confidence score represents the plausibility of placing a sentence boundary after the word w i ,<br />
that is, converting the stream of words into · · · , w i−1 , w i , 〈/s〉, 〈s〉, w i+1 , · · · , where 〈/s〉 and 〈s〉 are<br />
sentence start and end markers. The confidence score is based on the ratio of two probabilities arising<br />
from two hypotheses defined below:<br />
Hypothesis I : there is no sentence boundary after word w i . The corresponding Markov chain is,<br />
P 〈I〉<br />
i<br />
= P left · P (wi+1 i+n−1 ) · P right<br />
i+n−1<br />
∏<br />
= P left · p(w k |w k−1<br />
k−n+1 ) · P right (1)<br />
k=i+1<br />
where p denotes the probability from the language model, P left and P right are the probabilities of the left<br />
and right contexts, of the words wi+1 i+n−1 .<br />
Hypothesis II : there is a sentence boundary after the word w i . The corresponding Markov chain is,<br />
P 〈II〉<br />
i<br />
= P left · P (〈/s〉, 〈s〉, w i+n−1<br />
i+1<br />
) · P right<br />
= P left · p(〈/s〉|w i i−n+2) · p(w i+1 |〈s〉) ·<br />
i−n+1<br />
∏<br />
k=i+2<br />
The confidence score is defined as the ratio of the probabilities P 〈II〉<br />
i<br />
s i = P 〈II〉<br />
i<br />
P 〈I〉<br />
i<br />
= p(〈/s〉|w i i−n+2) ·<br />
p(w k |w k−1<br />
i+1 , 〈s〉) · P right (2)<br />
and P 〈I〉<br />
i<br />
, that is,<br />
i−n+1<br />
p(w i+1 |〈s〉)<br />
p(w i+1 |wi−n−2 i ) ·<br />
∏ p(w k |wi+1 k−1 , 〈s〉)<br />
p(w<br />
k=i+2 k |w k−1<br />
k−n+1 ) (3)<br />
142