D2.1 Requirements and Specification - CORBYS

Recommendations

Info

D2.1 Requirements and Specification user’s command; b) this multimodal input event is then semantically interpreted at a higher level to deduce the intended meaning of the user’s action by “extracting and combining information chunks” (Gibbon et al. 2000). The main aim of multimodal fusion is to make sense of the user’s intended meaning by fusing the partial streams of input that constitute a user’s command from various input modalities of the system. Hurtig and Jokinen (2006) cite a “classical example of coordinated multimodal input” as the Put-that-there system proposed by Bolt (1980). This system presented a scenario where the user of a system specified input commands in a natural manner by way of speech and hand gestures. According to Hurtig and Jokinen (2006), multimodal input fusion can be categorised as a process that occurs over three levels: i) signal-level, ii) feature-level and iii) semantic, where semantic fusion integrates the understanding acquired from inputs of various individual modalities into a single comprehension set that signifies the intended meaning of the user’s input. Hurtig and Jokinen (2006) propose a three level semantic fusion component which involves temporal fusion that creates combinations of input data, statistically motivated weighting of these combinations, and, a discourse level phase that selects the best candidate. The weighting process uses three parameters namely overlap, proximity and concept type of the multimodal input events (Hurtig and Jokinen, 2006). Overlapping and proximity are related to temporal placement of the input events where proximity is said to play a very important role especially in modalities such as speech and tactile data (Hurtig and Jokinen, 2006). Concept type refers to user commands which would be incomplete if the system were to consider only one input modality for example a user pointing to a location and speaking a phrase that only incompletely describes the location (Hurtig and Jokinen, 2006). Once the weighted ranking has been performed, a list of these candidates is passed on to the discourse level phase that selects the best ranked final candidate for system response construction. If no candidate fits, the system requests the user to repeat his/her command (Hurtig and Jokinen, 2006). Multimodal fusion encapsulates the union of a number of data from a number of input channels in a multimodal interactive system (Landragin, 2007) where each channel may represent a distinct modality whereby a user may interact actively or passively with the system. Active input can be categorised as a direct usage of the system on the part of the user by way of speech or gestures etc. whereas passive input may be understood as input acquired from the user as a result of a monitoring activity on the part of the system. According to Landragin (2007), multimodal fusion is distinguished in terms of several sub-processes namely: a) multimodal coordination, b) content fusion and c) event fusion. Multimodal coordination associates two activities acquired by different modalities for the formation of a “complete utterance” (Landragin, 2007). The output of this sub-process contains a set of paired “hypotheses”, with an associated confidence level, which are ingested by the multimodal content fusion sub-process to develop an improved comprehension of otherwise partial information (Landragin, 2007). The last sub-process, multimodal event fusion, then unifies the “pragmatic forces of mono-modal acts” to create a complete understanding of the user’s input. Landragin (2007) uses the general communication categories as considered in the study of classical natural language, and lists them as being the following: 1) inform, 2) demand, 3) question and 4) act. Inform, demand and question are fairly easy to understand as scenarios where the user may be providing information to the system, requiring the system to do something and querying the system to provide him/her with some information. Act is the general category which comes into play when the system is not able to label the user’s input act into any of the aforementioned three categories (Landragin, 2007). The fusion process is divided into five levels by the Joint Director of Labs / Data Fusion Group revised model (JDL/DFG) namely: pre-processing (level 0), object refinement (level 1), situation refinement (level 2), 114
D2.1 Requirements and Specification impact assessment or threat refinement (level 3), and lastly process refinement (level 4). This model, however, does not incorporate human-in-the-loop, and another level called user refinement (level 5) has been proposed to “delineate the human from the machine in the process refinement” allowing for the human to play an important role in the fusion process. (Blasch and Plano, 2002) • Level 0: Sub-object assessment • Level 1: Object assessment • Level 2: Situation assessment • Level 3: Impact assessment • Level 4: Process refinement • Level 5: User refinement Level 0 and 1 deal with sub-object and object assessment respectively, making use of information from multiple sources to arrive at a representation of objects of interest in the environment. In level 2, a relationship between the identified objects is established. At the end of this level, once the situation assessment process is complete, the system has achieved situation awareness as the objects detected in the environment and the various ways in which they are related or connected with each other is known by the system. Level 3 allows the system to predict effects of actions or situations on the environment. Level 4 attempts to refine the outcomes of levels 1, 2 and 3. Corradini et al. (2005) list a number of approaches and architectures for multimodal fusion in multimodal systems such as carrying out multimodal fusion in a maximum likelihood estimation framework, using distributed agent architectures (e.g. Open Agent Architecture OAA (Cheyer and Martin, 2001)) with intraagent communication taking place through a blackboard, identification of individuals via “physiological and/or behavioural characteristics” e.g. biometric security systems using fingerprints, iris, face, voice, hand shape etc. (Corradini et al. 2005). It is stated by Corradini et al (2005) that modality fusion in such systems involve less complicated processing as they fall largely under a “pattern recognition framework” and that this process may use techniques for integrating “biometric traits” (Corradini et al. 2005) such as the weighted sum rule as in Wang et al. (2003), Fisher discriminant analysis (Wang et al. 2003), decision trees (Ross and Jain, 2003), decision fusion scheme (Jain et al, 1999) etc. Corradini et al (2005) also list a number of systems fusing speech and lip movements such as using histograms and multivariate Gaussians (Nock et al. 2002), artificial neural networks (Wolff et al. 1994; Meier et al. 2000) and hidden Markov models (Nock et al. 2002). Some systems use independent individual modality processing modules such as speech recognition modules, gesture recognition module, gaze localisation etc. Each module carries out mono-modal processing and presents the output to the multimodal processing module which handles the semantic fusion. These systems are ideal for introducing a shelf framework where various showcases may be developed for different application domains applying re-usable off-the-shelf components each handling a single modality in full. Other systems include “quick set” (Landragin, 2007) which offers the user the freedom to interact with a mapbased application using a pen-and-speech cross-modal input capability. The system presented in Elting (2002) enables the user to specify a command by way of speech, pointing gesture and the input from a graphical user interface into a “pipelined architecture”. The system put forward by Wahlster et al (2001) is a multimodal 115
Page 1 and 2:
CORBYS Cognitive Control Framework
Page 3 and 4:
D2.1 Requirements and Specification
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74: D2.1 Requirements and Specification
Page 123: D2.1 Requirements and Specification
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Page 207 and 208:
Page 209:
show all

D2.1 Requirements and Specification - CORBYS

Create successful ePaper yourself

Delete template?

Save as template?