Proceedings of the 13 ESSLLI Student Session - Multiple Choices ...

More documents

Recommendations

Info

Proceedings of the 13 th ESSLLI Student Session Figure 1: A human user conversing with an artificial DP in our interaction environment (structured as in section 2). A dialogue recorder wiretaps their conversation. We do not (and do not need to) take into account the content of the dialogues and in fact we limit our speech analysis to simple prosodic features for the EoT prediction. Thus, for this work, we abstract away from all questions of content management and let our dialogue participants speak randomly selected pre-recorded utterances – though with proper turn-taking. The remainder of the paper is structured as follows: Section 2 describes the system architecture and Section 3 the corpora we use. Section 4 evaluates the speech state classification and Section 5 demonstrates and evaluates some simple turn-management strategies. We close with conclusions and ideas for further work. 2 Architecture of the Interaction Environment Our architecture defines an interaction environment in which dialogue participants (DPs) communicate with each other. Interaction is purely non-symbolic, using asynchronous audio streams over RTP (Schulzrinne, Casner, Frederick and Jacobson, 2003). There is no common clock, or other synchronisation required between DPs. The architecture provides a headset tool for human DPs, and monitoring tools to listen to ongoing dialogues and to record them to disk. Figure 1 shows two dialogue participants – one human, one artificial – conversing in the environment described above. The artificial DP on the right of figure 1 is structured as described below. Artificial DPs are realized as modular and extensible collections of event-driven software agents in the open agent architecture, OAA (Martin, Cheyer and Moran, 1999). In the OAA each software agent advertises its own abilities to solve problems (such as generating utterances) and may itself request other agents to solve sub-problems (e. g. sending data over RTP). For audio processing inside the DP we rely on the Sphinx-4 framework (Walker, Lamere, Kwok, Raj, Singh, Gouvea, Wolf and Woelfel, 2004) which we extended for our audio-processing pipeline. In the current system, we do not yet use Sphinx’ abilities as a speech recognizer and most other modules that would be needed for a real dialogue system are missing. These are obvious enhancements for later versions. 18
21 Speech Generation Speech generation consists of a synthesizer and a dispatcher. The synthesizer currently selects from a corpus of pre-recorded utterances and will be extended to include text-tospeech. To make turn-taking management harder and the system more realistic a fixed delay of 100 ms between signal to the module and onset of the recorded utterance is introduced at this point. 1 This delay is realized by sending 100 ms of recorded silence before the utterance and utterances are also followed by 100 ms of recorded silence. (If DPs were to send digital zeros directly before and after their utterances, speech state classification, as described below, would become trivial.) The speech dispatcher continuously sends an RTP stream in packets of 10 ms, either audio from a file or sine waves if so instructed by the synthesizer, or silence (digital zero). It can also be ordered to interrupt the audio and to revert to silence. The dispatcher also publishes its current speech state which may be one of sil, start of turn (SoT), talk, or end of turn (EoT) to the DP it is part of. 22 Speech Analysis Proceedings of the 13 th ESSLLI Student Session Speech analysis focuses solely on local prosodic analysis for the classification of the listening state (which should reflect the interlocutor’s speech state, as described above). In order to be effective, classification must happen with as short a lag as possible. While short lags would allow for reactive behaviour, we aim to predict when the interlocutor’s end of turn is approaching in order to achieve smooth turn changes and counter-balance the 100 ms lag before a response can be uttered by the speech generation. We use machine learning to classify each received frame (10 ms) of audio as silence (sil), ongoing talk (talk) or end of turn (EoT). Classification is based exclusively on signal power, pitch and derived features. Our pitch extraction is modelled after the first three steps of the YIN algorithm (de Cheveigné and Kawahara, 2002). As no smoothing or dynamic programming is applied to the pitch extraction, results are computed incrementally in real-time and become available instantaneously. The algorithm runs at several times real-time on average hardware. On the corpora described below, the gross error rate is 1.6 % compared to the well known ESPS algorithm (Talkin, 1995). In order to track changes over time, we derive features by windowing over past values of pitch and power with sizes ranging from 20 to 500 ms. While the features calculated on smaller windows help to smooth and to remove outliers due to failures of the pitch extraction, the larger windows are expected to capture long-term trends. We calculate the arithmetic mean and the range of the values, the mean difference between values within the window and the relative position of the minimum and maximum. We also perform a linear regression and use its slope, the MSE of the regression and the error of the regression for the last value in the window. 23 Turn-Taking Management The turn-taking management agent determines whether to start or stop emitting utterances on the basis of the states of the generation and analysis modules. An important aspect in turn-taking management is robustness. To be robust, the turn-taking strategy must not 1 In a dialogue system NLG and TTS would require processing time; for humans there is a delay between starting to plan an utterance and the start of the articulation (Levinson, 1983). 19
Page 1 and 2: Proceedings of the 13 th ESSLLI Stu
Page 3 and 4: Contents Preface . . . . . . . . .
Page 5 and 6: Preface Proceedings of the 13 th ES
Page 9 and 10: mechanism of the encoded function.
Page 13 and 14: The table below presents experiment
Page 17: SIMULATING SPOKEN DIALOGUE WITH A F
Page 31 and 32: We now state our update semantics f
Page 33 and 34: 3 Comparison With a Partition Seman
Page 35 and 36: We take this objection seriously, a
Page 37 and 38: BARE PREDICATION AND KINDS ∗ Bert
Page 39 and 40: The reason why the b-variants are o
Page 41 and 42: 2006)). The second is that nouns re
Page 43 and 44: 43 Non-bare predication and non-uni
Page 45 and 46: References Proceedings of the 13 th
Page 47 and 48: DIAGRAMMATIC REASONING WITH ENHANCE
Page 51 and 52: previously. Proceedings of the 13 t
Page 53 and 54: type theory of Euler diagrams based
Page 55 and 56: software (Stapleton and Delaney, 20
Page 57 and 58: FICTIONAL CONTINGENCIES Gemma Celes
Page 63 and 64: of as the individuals that these fi
Page 69 and 70:
Proceedings of the 13 th ESSLLI Stu
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
TOWARDS A NEW CHARACTERISATION OF C
Page 77 and 78:
The following paper continues this
Page 79 and 80:
Page 81 and 82:
only one increasing chain completel
Page 83 and 84:
the minimal step width seems to be
Page 85 and 86:
Page 87 and 88:
etrieved in order to check it again
Page 89 and 90:
Results The percentages of correct
Page 91 and 92:
The complexity of relative clauses
Page 93 and 94:
References Proceedings of the 13 th
Page 95 and 96:
A SALIENCE-DRIVEN APPROACH TO SPEEC
Page 97 and 98:
Page 99 and 100:
2. the linguistic salience or “re
Page 101 and 102:
Page 103 and 104:
where the expectations provided by
Page 105 and 106:
A LOGIC WITH A CONDITIONAL PROBABIL
Page 107 and 108:
• v : F orC × W −→ {0, 1} is
Page 109 and 110:
Page 111 and 112:
Lemma 4 M ∗ is a measurable model
Page 113 and 114:
is a tautology. Now we can equivale
Page 115 and 116:
A PROOF-THEORETIC APPROACH TO FRENC
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
References Abeillé, A., Godard, D.
Page 125 and 126:
INFINITE GAMES FROM AN INTUITIONIST
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
6 Further problems Proceedings of t
Page 135 and 136:
Page 137 and 138:
pointers to the text, containing th
Page 139 and 140:
example in a comparatively short te
Page 141 and 142:
Such an example is: Constant modifi
Page 143 and 144:
WORD SPACE MODELS OF SEMANTIC SIMIL
Page 145 and 146:
tween semantic similarity and seman
Page 147 and 148:
wu & palmer 0.0 0.2 0.4 0.6 0.8 1.0
Page 149 and 150:
Page 151 and 152:
frequency F−score 0 100 200 300 4
Page 153 and 154:
EXAMINING THE NOTICING FUNCTION OF
Page 155 and 156:
Page 157 and 158:
effects of input flood and individu
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
CDIPROVER3: A TOOL FOR PROVING DERI
Page 167 and 168:
Page 169 and 170:
Example 8. Consider the TRS given i
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
THE RANK(S) OF A TOTALLY LEXICALIST
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
EXPRESSING CONJUNCTIVE AND AGGREGAT
Page 187 and 188:
Page 189 and 190:
(Rule) (Semantic Action) Qwh → In
Page 191 and 192:
(⇐) We will prove, by induction o
Page 193 and 194:
Page 195 and 196:
1 Introduction INTERROGATION IN DYN
Page 197 and 198:
Page 199 and 200:
to hold for questions to be felicit
Page 201 and 202:
5 Further research One goal for a s
Page 203 and 204:
THE SEMANTIC CHANGE OF THE FRENCH -
Page 205 and 206:
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
6 Conclusion In this paper, I inves
Page 213 and 214:
ADVERSARY IMPLICATURES ∗ Grégoir
Page 215 and 216:
Page 217 and 218:
Page 219 and 220:
2.2.1 The derivation of Q-Implicatu
Page 221 and 222:
of the first, the relation of Contr
Page 223 and 224:
List of authors Martin Avanzini Mar
show all

Proceedings of the 13 ESSLLI Student Session - Multiple Choices ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?