Multiple Sensor Multiple Object Tracking With GMPHD Filter - ISIF
Multiple Sensor Multiple Object Tracking With GMPHD Filter - ISIF
Multiple Sensor Multiple Object Tracking With GMPHD Filter - ISIF
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
delay of arrival measurement (TDOA) z q k<br />
is measured from the<br />
q-th microphone pair at time k. The measurement equation is<br />
z q k<br />
= T q (x k ) + v q k<br />
; q = 1; :::; Q (28)<br />
T q (x k ) = kx k p 2;q k kx k p 1;q k<br />
(29)<br />
c<br />
where p i;q is the position of microphone i of pair q, c is the<br />
speed of sound, and v q k<br />
N(0; 4 10 9 ) is uncorrelated<br />
noise. Because the measurement equation (28) is not linear<br />
Gaussian, we need to approximate the linear system by using<br />
unscented transform in <strong>GMPHD</strong> lter [12]. Each speaker has a<br />
probability of survival at time k is p S;k = 0:95, the probability<br />
of detection is p D;k = 0:7. To extract the TDOA for multiple<br />
speakers, we applied the method from [13]. Figure 4 shows<br />
an example to collect TDOA measurements at a microphone<br />
pair (for example microphone pair 2).<br />
Fig. 2. Position (x; y) of targets with measurements from sensor 2<br />
Fig. 4.<br />
GCC TDOA measurements<br />
Fig. 3.<br />
Position (x; y) of targets with fusion method<br />
microphone pairs, each of them has an inter-sensor spacing of<br />
0.5m. The speaker sources are all female. The acoustic image<br />
method [16] was used to simulate the room impulse responses.<br />
The reverberation time of the room impulse responses is about<br />
T 60 = 0:15s. The speech signal to noise ratio is about 20dB.<br />
There are 60 frames. The time frame length for measuring<br />
TDOA is 256ms, and they are non-overlapping. There are two<br />
speakers. They appeared and disappeared at different times.<br />
Let x k be the state of a speaker at time k. Here, the state<br />
is the position (x; y) of speaker. We assume that the dynamic<br />
moving equation can be given<br />
x k = Ax k 1 + w k (27)<br />
where A = [I] and w k N([0; 0]; diag([0:01; 0:01])). This<br />
means the average distance from the previous time k 1 to<br />
k of a speaker is about 10 cm. Given a speaker x k , the time<br />
Figures 5 and 6 show the multi-speaker tracking performance<br />
of particle PHD lter [14]. Because of the unreliable in<br />
clustering technique, the state estimaties are affected. Figures<br />
7 and 8 show the multi-speaker tracking performance of our<br />
method. This performance is better than particle PHD lter. In<br />
most of the time that two persons speak simultaneously, our<br />
method can give reliable estimations. This is because <strong>GMPHD</strong><br />
lter does not depend on clustering techniques. The state<br />
estimates are extracted from means of Gaussian components<br />
that have high weights.<br />
The above result is the performance for one trial. To<br />
measure the average performance, we used the performance<br />
measurement from [13]. It includes the probability of correct<br />
speaker number, expected absolute error on the number of<br />
speaker and conditional mean distance error by Wasserstein<br />
distance. The probability of correct speaker number is dened<br />
by<br />
P (j ^X k j = jX k j) (30)<br />
where ^X k is the estimation of multi-speaker state and X k is<br />
ground-truth. The expected absolute error on the number of<br />
speaker is<br />
E(j ^X k j jX k j) (31)