Finding Musically Meaningful Words by Sparse CCA

More documents

Recommendations

Info

vocab.sz. 488 249 203 149 103 50# CAL500 words 173 118 101 85 65 39# Web2131 words 315 131 102 64 38 11%Web2131 .64 .52 .50 .42 .36 .22Table 2: The fraction of noisy web-mined words in a vocabulary as vocabulary size is reduced:As the size shrinks sparse CCA prunes noisy words and the web-mined words are eliminated overhigher quality CAL500 words.4.2 Quantitative Results4.2.1 Vocabulary Pruning using Sparse CCASparse CCA can be used to perform vocabulary selection where the goal is to prune noisy wordsfrom a large vocabulary. To test this hypothesis we combined the vocabularies from the CAL500and Web2131 data sets and consider the subset of 363 songs that are found in both data sets.Based on our own informal user study, we found that the Web2131 annotations are noisy when comparedto the CAL500 annotations. We showed subjects 10 words from each data set and asked themwhich set of words were relevant to a song. The Web2131 annotations were not much better thanselecting words randomly to from the vocabulary, whereas CAL500 words were mostly consideredrelevant.Because Web2131 was found to be noisier than CAL500, we expect sparse CCA to filter out moreof the Web2131 vocabulary. Table 2 shows the results of this experiment. In the experiment weselect a value for the sparsity parameter ρ and obtain a sparse vocabulary. Then we record howmany noisy Web2131 words comprise the resulting vocabulary. The first column in Table 2 reflectsthe vocabulary with no sparsity constraints. Because the starting vocabularies are of different sizes,Web2131 initially comprises .64 of the combined vocabulary size. Subsequent columns in the tableshow the resulting vocabulary sizes as we increase ρ and, consequently, reduce the vocabulary size.Noisy Web2131 words are being discarded by our vocabulary selection process at a faster rate thanthe cleaner CAL500 vocabulary, upholding our prediction that vocabulary selection by accousticcorrelation should tend to remove noisy words. The results imply that Web2131 contains morewords that are uncorrelated with the audio representation. The different ways that these data setswere collected reinforces this fact. Web2131 was mined from a collection of music reviews; thewords used in a music review are not explicitly chosen by the writer because they describe a song.This is the exact opposite condition under which the CAL500 data set was collected, in which humansubjects were specifically asked to label songs in a controlled environment.4.2.2 Vocabulary Selection for Music RetrievalIn this experiment we apply our vocabulary selection technique to a semantic music annotationand retrieval system. In brief, our system estimates the conditional probabilities of audio featurevectors given words in the vocabulary, P(song|word). These conditional probabilities are modeledas Gausian Mixture Models. With these probability distributions in hand, our system can annotate anovel song with words from its vocabulary, or it can retrieve an ordered list of (novel) songs basedbased on a keyword query. A full description of this system can be found in [17].One useful evaluation metric for this system is the area under the receiver operating characteristic(AROC) curve. (A ROC curve is a plot of the true positive rate as a function of the false positiverate as we move down a ranked list of songs given a keyword input.) For each word, its AROCranges between 0.5 for a random ranking and 1.0 for a perfect ranking. Average AROC is used todescribe the performance of our entire system and is found by averaging AROC over all words inthe vocabulary.This final performance metric is brought down by words that are difficult to model. These are wordsthat have a low AROC so they bring the average AROC down. We propose to use sparse CCAto discover these difficult words (prior to modeling). (Alternatively, one can say that we are usingsparse CCA to keep “easy” words that exhibit high correlation, as oppsed to discarding difficultwords.)6
0.760.750.74sparse ccahuman agreementrandom0.73mean avg roc0.720.710.70.690.680.67180160140120100vocab size80604020Figure 1: Comparison of vocabulary selection techniques: We compare retrieval performance usingvocabulary selection by human agreement and acoustic correlation. The expected performance ofrandomly selecting words is also shown.In this experiment, we use sparse CCA to generate sparse vocabularies of full size ( about 180)down to 20. This is done by generating a sequence of vocabularies of progressivly smaller sizesas is described in Section 4.1. For each vocabulary the conditional probability densities associatedwith these words, or word models, are trained and the average AROC of the resulting word modelsis calculated on a test set of 50 songs held out separately from a training set of 450 songs from theCAL500 data set.Figure 1 shows that the average AROC of our system improves as sparse CCA selects vocabulariesof smaller size. Any increase from the left most point of the figure implies that the technique isremoving words which would have had a low AROC (thus bringing the average down). To reiterate,it is exactly the words with low AROC that we presume to be noisy and are difficult to model.Also shown in Figure 1 are the performance results of training based on two alternative vocabularyselection techniques. One is a random baseline in which a vocabulary is created by randomly selectingwords from the full vocabulary; the expected (mean) performance of this method is shown.The other technique is based on a heuristic we used in [15] in which we assigned to words a scorebased on the notion of human agreement. Then we select a vocabulary based on high agreementscores. (Details can be found in [15].) Briefly, because we have access to more than one humanannotation per song, we count how many times people agreed with the use of a particular word todescribe a song. For example, more subjects tended to agree on more objective instrumentationterms like “this song has a drum.”, as opposed to more subjective usage terms like “I would listento this song while waking up in the morning.” In this case “drum” would get a higher human agreementscore than “waking up,” and so we posit that it would be easier to model “drum” (it would havea higher AROC) than it would to model “waking up”. Our results show that vocabulary selectionusing acoustic correlation outperforms using this very logical heuristic of human agreement.It should be noted that the goal here is not to raise the performance of some arbitrary system, ratherthat raising the performance of this system suggests that our vocabulary selection method is removingwords that are difficult to model well and is selecting words that can be accurately modeled.Also, as a practical matter, our results show that this technique could be used, in a preprocessingstep, to inform as to which words to spend computational resources on when modeling.5 DiscussionWe have presented acoustic correlation via sparse CCA as a method by which we can automatically,and in an unsupervised fashion, discover words that are highly correlated with their audio featurerepresentations. Our results suggest that this technique can be used to remove “noisy” words that aredifficult to model accurately. This ability to filter poorly represented words also provides a means toconstruct a musically meaningful vocabulary prior to investing further computational resources inmodeling or analyzing the semantics of music as is done in our music annotation and retrieval system.Those interested in the application of vocabulary selection and music analysis are encouraged7
Page 1 and 2: Finding Musically Meaningful Words
Page 3 and 4: ( )Σaa Σwhere Σ =asis the covari
Page 5: Top 3 words by semantic categoryAco

Finding Musically Meaningful Words by Sparse CCA

Create successful ePaper yourself

Delete template?

Save as template?