21.01.2014 Views

improving music mood classification using lyrics, audio and social tags

improving music mood classification using lyrics, audio and social tags

improving music mood classification using lyrics, audio and social tags

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.1.2 Social Tags<br />

Since <strong>social</strong> <strong>tags</strong> on last.fm were used for identifying <strong>mood</strong> categories <strong>and</strong> the ground truth<br />

dataset will be built <strong>using</strong> a very similar method (see Section 5.2), last.fm is used for collecting<br />

<strong>social</strong> <strong>tags</strong> applied to the songs in the <strong>audio</strong> collections. For each song, the 100 most popular <strong>tags</strong><br />

applied to it are provided by the last.fm API. The <strong>social</strong> <strong>tags</strong> used in building the dataset were<br />

collected during the month of February 2009, <strong>and</strong> 12,066 of the <strong>audio</strong> pieces had at least one<br />

last.fm tag.<br />

5.1.3 Lyric Data<br />

Knees, Schedl, <strong>and</strong> Widmer (2005) extracted <strong>lyrics</strong> from the Internet by querying the Google<br />

search engine with keywords in the form “track name” + “artist name” + “<strong>lyrics</strong>,” but the results<br />

showed limited precision despite high recall. For this thesis research, precise <strong>lyrics</strong> are required,<br />

<strong>and</strong> thus <strong>lyrics</strong> were gathered from online lyric databases, instead of <strong>using</strong> general search engines.<br />

Lyricwiki.org was the primary resource because of its broad coverage <strong>and</strong> st<strong>and</strong>ardized format.<br />

Mldb.org was the secondary website which was consulted only when no <strong>lyrics</strong> were found on the<br />

primary database. To ensure data quality, the crawlers were implemented to use song title, artist<br />

<strong>and</strong> album information to identify the correct <strong>lyrics</strong>. In total, 8,839 songs had both <strong>social</strong> <strong>tags</strong><br />

<strong>and</strong> <strong>lyrics</strong>. A language identification program 10 was then run against the <strong>lyrics</strong>, <strong>and</strong> 55 songs<br />

were identified <strong>and</strong> manually confirmed as non-English, leaving <strong>lyrics</strong> for 8,784 songs.<br />

The <strong>lyrics</strong> databases do not provide APIs for downloading. Hence one has to query the<br />

databases <strong>and</strong> download the displayed pages. This makes it a necessary step to clean up<br />

10 Available at http://search.cpan.org/search%3fmodule=Lingua::Ident<br />

55

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!