29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Improving Categorisation in Social Media using Hyperlinked Object Metadata<br />

Sheila Kinsella<br />

Digital Enterprise Research Institute, National University of Ireland, <strong>Galway</strong><br />

sheila.kinsella@deri.org<br />

Abstract<br />

Categorising social media posts is challenging, since<br />

they are often short, informal, and rely on external<br />

hyperlinks for context. We investigate the potential of<br />

external hyperlinks for classifying the topic of posts. We<br />

focus on objects with related structured data available.<br />

We show that including metadata from hyperlinked<br />

objects significantly improves classifier performance.<br />

We use the structure of the data to compare the effects<br />

of different metadata types on categorisation.<br />

1. Introduction<br />

Hyperlinks are often a vital part of online conversation.<br />

Users share videos or photos they have seen and point<br />

to products or movies they are interested in. These<br />

external resources can provide useful new data such as<br />

author information for books, or genre information for<br />

movies. Often the post cannot be fully understood<br />

without knowing these details. Many of the hyperlinks<br />

point to websites that provide metadata for objects, e.g.,<br />

videos (YouTube) or products (Amazon), and publish<br />

this data in a structured format via an API or as Linked<br />

Data. Structured data is useful since it allows relevant<br />

data types to be identified. Fig. 1 gives an example of a<br />

post where useful information can be gleaned from the<br />

metadata of a hyperlinked object. Some data such as the<br />

author is redundant, but the book title and categories<br />

are new. The title and categories can be useful for<br />

classifying the post, e.g., under a ‘Rugby’ topic.<br />

Fig. 1:Enriching a post with metadata from hyperlinks<br />

2. Experimental Setup<br />

We use two datasets, one from a Forum (or message<br />

board) and one from Twitter (a microblogging site).<br />

We identified hyperlinks to sources of structured data,<br />

and retrieved the HTML pages as well as the relevant<br />

metadata. We used a Naïve Bayes classifier to compare<br />

classification based on post content (with and without<br />

URLs), hyperlinked HTML pages, external metadata<br />

from hyperlinked objects, and combinations of these.<br />

We experimented with different methods of combining<br />

91<br />

sources and report the best results. Results shown are<br />

micro-averaged F1 score (± 90% confidence interval).<br />

3. Experimental Results<br />

Table 1 shows results for both datasets. Classification<br />

based on HTML pages alone gives poor results.<br />

Classification based on metadata alone increases F1 for<br />

Forum but decreases F1 for Twitter. Combining<br />

content and HTML pages improves F1, and combining<br />

content and metadata gives even better results.<br />

Data Source Forum Twitter<br />

Content (baseline) 0.811 ± 0.008 0.759 ± 0.015<br />

HTML 0.730 ± 0.007 0.645 ± 0.020<br />

Metadata 0.835 ± 0.009 0.683 ± 0.018<br />

Content+HTML 0.832 ± 0.007 0.784 ± 0.016<br />

Content+Metadata 0.899 ± 0.005 0.820 ± 0.013<br />

Table 1: F1 score for each post representation<br />

Table 2 shows the results of classification based on<br />

individual metadata types in Forum compared to post<br />

content. For posts that link to Wikipedia, the article<br />

descriptions and categories provide a better indicator of<br />

the post topic than the post itself. For posts that link to<br />

YouTube, the video title, description and tags provide<br />

better indicators of the post topic than the original post.<br />

Data Source Wikipedia YouTube<br />

Content (baseline) 0.761 ± 0.014 0.709 ± 0.011<br />

Object Titles 0.685 ± 0.016 0.773 ± 0.015<br />

Object Descriptions 0.798 ± 0.016 0.752 ± 0.010<br />

Object Categories 0.811 ± 0.012 0.514 ± 0.017<br />

Object Tags N/A 0.838 ± 0.019<br />

Table 2:F1 score for metadata types in Forum<br />

4. Conclusion<br />

Our results show that categorisation in social media can<br />

be significantly improved by including metadata from<br />

hyperlinked objects. Different metadata types vary in<br />

their usefulness for post classification, and some types<br />

of object metadata are more useful for classification<br />

than the actual content of the post. We conclude that<br />

hyperlinks to structured data sources, where specific<br />

metadata can be identified and extracted, are a valuable<br />

input for post categorization.<br />

8. References<br />

[1] S. Kinsella, M. Wang, J.G. Breslin and C. Hayes,<br />

‘Improving Categorisation in Social Media using Hyperlinks<br />

to Structured Data Sources’, Proceedings of ESWC 2011.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!