Min Hash - SIPL

People Metering Using 

Mobile Devices 

Yehoraz Kasher Annual EE Projects Contest 

Students: 

Supervisor: 

Oded Yeruhami 

Yuval Bahat 

Rafi Steinberg 

May 23 rd , 2010

Outline 

• People metering 

• People metering using mobile devices 

• Algorithm description 

• Our innovations 

• Conclusion 

2/35

3/35 

People Metering

The People Meter 

Rating measurement method today 

Drawbacks: 

– Designated hardware 

– Very small control group 

– Monitors specific TV set 

– Confidentiality of control group 

– Impersonal 

4/35

People Metering Using Mobile Devices 

Query 

Fingerprint 

Creation 

Reference 

Fingerprint 

Creation 

Matching 

Matched Channel / No Match 

5/35

People Metering Using Mobile Devices 

• As suggested by MobileRL 

– No special hardware required 

– Personal 

– Carried everywhere 

– Can also be used to monitor 

radio, video, music etc. 

6/35

System Requirements 

• High accuracy 

• Robustness to noisy environment 

• Real time results 

• Cellphone constraints: 

»Privacy 

»Computational effort 

»Power consumption 

»Sent data size 

7/35

Literature Survey 

• ASF – Audio Spectrum Flatness (Hellmuth et al., 2001) 

• An algorithm by (Haitsma & Kalker, 2003) 

• Spectral Similarity (Yang, 2001) 

• Normalized Spectral Subbands Centroids (Seo et al., 2005) 

• Waveprint - Content Fingerprinting Using Wavelets, 

by (Baluja & Covell, 2006) 

8/35

System Layout 

Feature Extraction 

Extracting significant data 

Fingerprints 

Matching 

Matched channel / 

9/35 

No match

Fingerprint Creation 


Spectrogram 

creation 

Matching 

2-D Haar 

Wavelet 

Taking 

Strongest 

Coefficients 

Min 

Hash 

10/35 

Fingerprint 

Min-Hash vector #1 




Sub-Fingerprints

Frequency 

Spectrogram 

Represents audio visually 

stride 

Spectrogram creation 

2-D Haar Wavelet 

Taking strongest coefficients 

Min-Hash 

Time 

Divide to overlapping sub-spectrograms 

Fine-grained temporal resolution 

11/35

Wavelet Transform 

• Good for pointing out local data in images 

• Keeping only the strongest coefficients 





Wavelet 

Transform 

Keeping 

Strongest 

Coefficients 

12/35 

Maintains “interesting” data in noisy images 

Result – sparse binary vector


How can we compare 2 sparse vectors 

efficiently 

Min-Hash (Cohen et al., 2001) 





Sparse vector compact representation : 

• p permanent vector permutations 

• Keeping the index of the first “1” 

01 1 0 0 1 10 1 0 0 1 0 1 1 0 0 1 0 1 1 

0 

13/35 

• Result – A vector with p elements (here, p=4): 

1 3 6 2 

Sub-fingerprint 

• Compactly representing the sparse vector 

• Similar sparse vectors yield similar Min-hash vectors

Feature Extraction – Summary 

Spectrogram 

creation 

Haar Wavelet Transform 

Keeping strongest coefficients 

Sub-Fingerprint Sparse Binary #1 Vector 

Fingerprint 

Sub-Fingerprint #2 

Min Hash Vector (p elements) 

Sub-Fingerprint #3 

14/35

Fingerprint Matching 


Matching 

Sub-Fingerprint 

Sub-Fingerprints 

Candidate Sub-Fingerprint Selection 

Fingerprint Temporal Alignment 

(LSH) 

Candidates 

Query 

References 

Matching 

15/35 

Best Match

Locality Sensitive Hashing 

(Gionis et al., 1999) 

LSH 

Alignment 

Matching 

Query Min-hash 

References Min-hash 

LSH 

Match 

Candidates 

• Reduces matching problem dimensions 

• Low computational complexity 

• Efficiently narrows down matching candidates 

16/35

LSH 

Fingerprints Temporal 

Distance Calculation 

Alignment 

Alignment 

Matching 

query stride 

reference stride 

DistanceDistance 

Distances 

17/35 

Grade Calculation

Verifying Our Implementation 

• Forced matching scenario 

• Tested with various digitally added noises 

White Gaussian noise, echo… 

Our system produced good results 

However our problem is more difficult… 

• No forced matching scenario 

• Recordings in a noisy environment 

18/35

References 

• 120 samples 

• 1 minute long 

Queries 

• 430 samples 

• 8 seconds long 

Datasets by 

Two query types: 

• “Good” recordings 

19/35 

• “Bad” recordings 

69% Match 

19% Match


Matching Criterion 

Distance threshold 

Matching 

– Requires threshold adjusting (varies with 

recording quality) 

Metrics 

Precision & Recall (per threshold) 

20/35 

Precision 

Recall 

True identification 

All identified 

True identification 

All queries

Original Algorithm Results 

Recall=13% 

Precision=96% 

Recall=65% 

Precision=78% 

21/35

Problem: 

Bad recordings - very low success rate 

Let's have a closer look… 

22/35

Frequency 

Frequency 

Success Rates Problem 

• Main problem appears in “bad recordings” 

Reference 

Query – “bad recording” 

Time 

23/35 

Time

Proposed Solution 

Biasing the 

wavelet picking 

Strongest wavelets picking histogram 

Frequency 

dimension 

Time 

dimension 

24/35 

DC Freq. Time Time/Freq.

After Weighted Wavelet Picking 

Good Recordings 

Recall=90% 

Precision=97% 

Recall=65% 

Precision=78% 

25/35

After Weighted Wavelet Picking 

Bad Recordings 

Recall=49% 

Recall=13% 

Precision=99% 

Precision=96% 

(Th.=0) Recall = 

Precision = 58% 

26/35


Matching Criterion 

Matching 

Different criterion: 

Recurrence check 

– Sending up to N queries sequentially 

– Looking for M recurring matches 

Advantages 

27/35 

– Increases success rates 

– No threshold needed 

– Independent of environment 

– Overcomes sporadic noise

Recurrence Check 

Recall=58%, N 10, M 4, Channels # 30 

P true =93% P false =0.9% 

For bad recordings! 

But… 

Increases size of sent data 

28/35

Reducing Signature Size – 1 st Solution 

Size 

 

 

Byte 

Depends on stride size 

Query stride 

#sub-fingerprints min-hash vector length 

(13.24KB for 8 sec. query) 

Reference stride 

Solution: 

- Switch between strides 

29/35 

- Change number of permutations

Reducing Signature Size – 1 st Solution 

Reference sub fingerprints20 Query sub fingerprints0.2 

Min-hash Permutation 0.5 

Recall 

55%, Querysize 1.33KB 

N 10, M 4 

P true =93%, P false =0.9%, E[sent size] = ~9KB 

For bad recordings! 

30/35

Reducing Signature Size – 2 nd 

• Golomb-Rice coding (Golomb & Solomon, 1966) 

• Near-entropy coding for an infinite geometrically 

distributed input 

• Utilizing Min-hash distribution - close to geometric : 

Cumulative Distribution Function 

Solution 

~20% 

Compression 

31/35

Conclusion 

Implemented a people metering 

system using mobile devices 

– Personal 

– Carried everywhere 

– Not only TV 

32/35

Conclusion 

Based on Waveprint algorithm by 

Innovation #1 

Biasing the wavelet picking 

- Match rates 

Innovation #2 

Recurrence check 

- Match rates 

33/35 

- Environment independent

Conclusion 

Innovation #3 

Reducing sent fingerprint size 

Innovation #4 

Compressing sent data 

34/35 

Sent data size

Conclusion 

• System is suitable for commercial use 

For example: 

P true =93%, P false =0.9%, E[sent size] = ~9KB 

• Supplied to MobileRL 

• A paper in the writing 

35/35

Acknowledgments 

• Rafi Steinberg 

• SIPL staff 

– Yair Moshe 

– Nimrod Peleg 

• MobileRL 

– Aron Weiss, CTO

37 

Backup

Reference 

[1] O.Hellmuth et al., Advanced Audio identification using MPEG-7 

content description. Fraunhofer institute for integrated circuits, 

Convention paper 5463, 111th convention USA, September 2001. 

[2] J.Haitsma, T.Kalker, A highly robust fingerprinting system. Philips 

Research, Journal of new music research, Volume 32, Number 2, 

pp. 211-221(11), June 2003. 

[3] C.Yang, Music database retrieval based on spectral similarity. 

Stanford university database group technical report 2001-14, IEEE 

workshop on applications of signal processing, 2001. 

[4] Jin S Seo et al., Audio Fingerprinting based on normalized spectral 

subband centroids. IEEE International Conference, pages 213-216 

Vol. 3, March 2005. 

38

Reference – Cont. 

[5] Shumeet Baluja, Michele Covell (2006) Content Fingerprinting Using 

Wavelets. CVMP, 198-207 

[6] E. Cohen, et al.. (2001) Finding interesting associations without 

support pruning. Knowledge and Data Engineering, 13(1) 

[7] A. Gionis, P. Indyk, R. Motwani (1999), Similarity search in high 

dimensions via hashing. Proc. InternationalConference on Very 

Large Data Bases,. 

[8] Golomb, Solomon W. (1966) Run-length encodings. IEEE Trans 

Info Theory 12(3):399 

[9] A students project guided by Mr. Yair Moshe – Initial Code. 

39

2nd Solution – Golomb Rice (Cont.) 

•Uses run-length encoding – 

Code-length rises with value 

# 

1 

95 

96 

255 

coded 

000001 

11011111 

111000000 

1111111011111 

t=200, M=32 

Compressed to 85% 

t=300, M=19 

Compressed to 78% 

(Added to the query size decreasing) 

40


How can we compare two sparse 

vectors efficiently 

Jaccard Coefficient: 




J M , N 

 

 

M 

M 

N 

N 

Problem: 

•Long sparse binary vectors 

41 

•Similarity calculation is complicated

Min Hash - SIPL

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?