Gesture-Based Interaction with Time-of-Flight Cameras
Gesture-Based Interaction with Time-of-Flight Cameras
Gesture-Based Interaction with Time-of-Flight Cameras
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
accuracy <strong>of</strong> the tracking is around 5–6 cm root mean square (RMS) for the head and<br />
shoulders and around 2 cm RMS for the head. The implementation <strong>of</strong> the procedure<br />
is straightforward and real-time capable.<br />
Features<br />
The discussion <strong>of</strong> TOF image features in Chapter 7 is divided into four individual<br />
parts. The first part in Section 7.1 will discuss the so-called generalized eccentricities,<br />
a kind <strong>of</strong> feature that can be used to distinguish between different surface types, i.e.<br />
one can for example distinguish between planar surface regions, edges, and corners<br />
in 3D. These features were employed for detecting the nose in frontal face images and<br />
we obtained an equal error rate <strong>of</strong> 3.0%. Section 7.2 will focus on a reformulation<br />
<strong>of</strong> the generalized eccentricities such that the resulting features become invariant<br />
towards scale. This is achieved by computing the features not on the image grid but<br />
on the sampled surface <strong>of</strong> the object in 3D. This becomes possible by using the range<br />
map to invert the perspective camera projection <strong>of</strong> the TOF camera. As a results, one<br />
obtains data that is irregularly sampled. Here, we propose the use <strong>of</strong> the Nonequi-<br />
spaced Fast Fourier Transform to compute the features. As a result, one can observe<br />
a significantly improved robustness <strong>of</strong> the nose detection when the person is moving<br />
towards and away from the camera. An error rate <strong>of</strong> zero is achieved on the test data.<br />
The third category <strong>of</strong> image features is computed using the sparse coding prin-<br />
ciple, i.e. we learn an image basis for the simultaneous representation <strong>of</strong> TOF range<br />
and intensity data. We show in Section 7.3 that the resulting features outperform<br />
features obtained using Principal Component Analysis in the same nose detection<br />
task that was evaluated for the geometric features. In comparison to the generalized<br />
eccentricities we achieve a slightly reduced performance. On the other hand, in this<br />
scenario the features were simply obtained under the sparse coding principle <strong>with</strong>out<br />
incorporating prior knowledge <strong>of</strong> the data or properties <strong>of</strong> the object to be detected.<br />
The fourth type <strong>of</strong> features, presented in Section 7.4, aims at the extraction <strong>of</strong> the<br />
3D motion <strong>of</strong> objects in the scene. To this end, we rely on the computation <strong>of</strong> range<br />
flow. The goal is the recognition <strong>of</strong> human gestures. We propose to combine the<br />
computation <strong>of</strong> range flow <strong>with</strong> the previously discussed estimation <strong>of</strong> human pose,<br />
i.e. we explicitly compute the 3D motion vectors for the hands <strong>of</strong> the person perform-<br />
ing a gesture. These motion vectors are accumulated in 3D motion histograms. We<br />
then apply a learned decision rule to assign a gesture to each frame <strong>of</strong> a video se-<br />
quence. Here, we specifically focus on the problem <strong>of</strong> detecting that no gesture was<br />
performed, i.e. each frame is either assigned to a one <strong>of</strong> the predefined gestures or to<br />
31