NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Improving Categorisation in Social Media using Hyperlinked Object Metadata<br />
Sheila Kinsella<br />
Digital Enterprise Research Institute, National University of Ireland, <strong>Galway</strong><br />
sheila.kinsella@deri.org<br />
Abstract<br />
Categorising social media posts is challenging, since<br />
they are often short, informal, and rely on external<br />
hyperlinks for context. We investigate the potential of<br />
external hyperlinks for classifying the topic of posts. We<br />
focus on objects with related structured data available.<br />
We show that including metadata from hyperlinked<br />
objects significantly improves classifier performance.<br />
We use the structure of the data to compare the effects<br />
of different metadata types on categorisation.<br />
1. Introduction<br />
Hyperlinks are often a vital part of online conversation.<br />
Users share videos or photos they have seen and point<br />
to products or movies they are interested in. These<br />
external resources can provide useful new data such as<br />
author information for books, or genre information for<br />
movies. Often the post cannot be fully understood<br />
without knowing these details. Many of the hyperlinks<br />
point to websites that provide metadata for objects, e.g.,<br />
videos (YouTube) or products (Amazon), and publish<br />
this data in a structured format via an API or as Linked<br />
Data. Structured data is useful since it allows relevant<br />
data types to be identified. Fig. 1 gives an example of a<br />
post where useful information can be gleaned from the<br />
metadata of a hyperlinked object. Some data such as the<br />
author is redundant, but the book title and categories<br />
are new. The title and categories can be useful for<br />
classifying the post, e.g., under a ‘Rugby’ topic.<br />
Fig. 1:Enriching a post with metadata from hyperlinks<br />
2. Experimental Setup<br />
We use two datasets, one from a Forum (or message<br />
board) and one from Twitter (a microblogging site).<br />
We identified hyperlinks to sources of structured data,<br />
and retrieved the HTML pages as well as the relevant<br />
metadata. We used a Naïve Bayes classifier to compare<br />
classification based on post content (with and without<br />
URLs), hyperlinked HTML pages, external metadata<br />
from hyperlinked objects, and combinations of these.<br />
We experimented with different methods of combining<br />
91<br />
sources and report the best results. Results shown are<br />
micro-averaged F1 score (± 90% confidence interval).<br />
3. Experimental Results<br />
Table 1 shows results for both datasets. Classification<br />
based on HTML pages alone gives poor results.<br />
Classification based on metadata alone increases F1 for<br />
Forum but decreases F1 for Twitter. Combining<br />
content and HTML pages improves F1, and combining<br />
content and metadata gives even better results.<br />
Data Source Forum Twitter<br />
Content (baseline) 0.811 ± 0.008 0.759 ± 0.015<br />
HTML 0.730 ± 0.007 0.645 ± 0.020<br />
Metadata 0.835 ± 0.009 0.683 ± 0.018<br />
Content+HTML 0.832 ± 0.007 0.784 ± 0.016<br />
Content+Metadata 0.899 ± 0.005 0.820 ± 0.013<br />
Table 1: F1 score for each post representation<br />
Table 2 shows the results of classification based on<br />
individual metadata types in Forum compared to post<br />
content. For posts that link to Wikipedia, the article<br />
descriptions and categories provide a better indicator of<br />
the post topic than the post itself. For posts that link to<br />
YouTube, the video title, description and tags provide<br />
better indicators of the post topic than the original post.<br />
Data Source Wikipedia YouTube<br />
Content (baseline) 0.761 ± 0.014 0.709 ± 0.011<br />
Object Titles 0.685 ± 0.016 0.773 ± 0.015<br />
Object Descriptions 0.798 ± 0.016 0.752 ± 0.010<br />
Object Categories 0.811 ± 0.012 0.514 ± 0.017<br />
Object Tags N/A 0.838 ± 0.019<br />
Table 2:F1 score for metadata types in Forum<br />
4. Conclusion<br />
Our results show that categorisation in social media can<br />
be significantly improved by including metadata from<br />
hyperlinked objects. Different metadata types vary in<br />
their usefulness for post classification, and some types<br />
of object metadata are more useful for classification<br />
than the actual content of the post. We conclude that<br />
hyperlinks to structured data sources, where specific<br />
metadata can be identified and extracted, are a valuable<br />
input for post categorization.<br />
8. References<br />
[1] S. Kinsella, M. Wang, J.G. Breslin and C. Hayes,<br />
‘Improving Categorisation in Social Media using Hyperlinks<br />
to Structured Data Sources’, Proceedings of ESWC 2011.