NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Reconstruction of Threads in Internet Forums<br />
Erik Aumayr, Jeffrey Chan and Conor Hayes<br />
Digital Enterprise Research Institute, <strong>NUI</strong> <strong>Galway</strong>, Ireland<br />
{erik.aumayr, jkc.chan, conor.hayes}@deri.org<br />
Abstract<br />
Online discussion boards, or Internet forums, are a<br />
significant part of the Internet. People use Internet<br />
forums to post questions, provide advice and<br />
participate in discussions. These online conversations<br />
are represented as threads, and the conversation trees<br />
within these threads are important in understanding the<br />
behaviour of online users. Unfortunately, the reply<br />
structures of these threads are generally not publicly<br />
accessible or not maintained. Hence, we introduce an<br />
efficient and simple approach to reconstruct the reply<br />
structure in threaded conversations. We contrast its<br />
accuracy against an existing and a baseline algorithm.<br />
1. Introduction<br />
Internet forums are an important part of the web for<br />
questions to be asked and answered and for public<br />
discussions on all types of topics. In forums, conversations<br />
are represented as a sequence of posts, or threads,<br />
where the posts are replies to one or more earlier posts.<br />
Links exist between posts if one is the direct reply to<br />
another. However, the reply structure of threads is not<br />
always available. For instance, the structure is not<br />
maintained by the provider, or lost. We propose a new<br />
method to reconstruct the reply structure of posts in<br />
forums. It uses a set of simple features and a decision<br />
tree classifier to reconstruct the reply structure of<br />
threads. We evaluate the accuracy of the algorithm<br />
against an existing and a heuristic baseline approach.<br />
2. Methodology<br />
Definitions A post in a thread provides us with the<br />
following, basic information: creation date, name of<br />
author, quoting: name of the quoted author and<br />
content. The creation date of posts establishes a<br />
chronological order. From that ordering we can<br />
compute the distance of one post to another. Distance<br />
means how far away is a post to its reply. If there is no<br />
other post between a post and its reply, then they have a<br />
post distance of 1. If there is another post in between,<br />
then the distance is 2, and so forth.<br />
Note that the data we use stores the reply interaction<br />
in the way that each post can only reply to one other<br />
post at once. Although a user can reply to several posts<br />
at once, and our approach is able to return more than<br />
one reply candidate, we limit replies to one target post<br />
in our evaluation.<br />
Baseline approaches In our data, we found that<br />
79.7% of the replies have a post distance of 1, i.e. they<br />
follow directly the post they refer to. Hence, our first<br />
baseline approach is to link each post to its immediate<br />
predecessor, called “1-Distance Linking”.<br />
80<br />
Wang et al. 2008 [1] introduced a thread<br />
reconstruction that relies on content similarity and post<br />
distance. That serves as our second baseline approach.<br />
Features Based on the information a pair of posts<br />
provides, we extract the following features for our<br />
classification task: reply distance, posting-time<br />
difference, author quoted and cosine similarity. The<br />
cosine similarity compares the contents of two posts and<br />
returns a similarity score from 0 to 1, where 0 means not<br />
similar and 1 means exactly equal.<br />
Classifier As a classifier, we investigate the widely<br />
used C4.5 decision tree algorithm. It handles huge<br />
amount of data very efficiently due to its relative<br />
simplicity which is important for our task to present a<br />
fast and efficient way of reconstructing threads.<br />
3. Evaluation<br />
For the evaluation we use a subset of our Boards.ie<br />
dataset. Namely 13,100 threads, consisting of 133,200<br />
posts in total.<br />
In order to compare classification results of the<br />
approaches, we use the measurements precision, recall<br />
and F-score where F-score is the harmonic mean of<br />
precision and recall. For training the classifier, we<br />
applied a 10 fold cross validation to minimise bias.<br />
Table 1 shows the comparison between our<br />
classification algorithm “ThreadRecon” and the two<br />
baseline approaches.<br />
Wang et al. 2008 1-Distance Linking ThreadRecon<br />
44.40% 79.70% 85.70%<br />
Table 1: F-score comparison between ThreadRecon and<br />
baseline approaches<br />
An extended version of this work will be published in<br />
“Reconstruction of Threaded Conversations in Online<br />
Discussion Forums”, International Conference on Weblogs<br />
and Social Media 2011<br />
8. References<br />
[1] Wang, Joshi, Cohen and Rosé. 2008. Recovering implicit<br />
thread structure in newsgroup style conversations. In<br />
Proceedings of the 2nd International Conference on Weblogs<br />
and Social Media (ICWSM II), 152<strong>–</strong>160