29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Reconstruction of Threads in Internet Forums<br />

Erik Aumayr, Jeffrey Chan and Conor Hayes<br />

Digital Enterprise Research Institute, <strong>NUI</strong> <strong>Galway</strong>, Ireland<br />

{erik.aumayr, jkc.chan, conor.hayes}@deri.org<br />

Abstract<br />

Online discussion boards, or Internet forums, are a<br />

significant part of the Internet. People use Internet<br />

forums to post questions, provide advice and<br />

participate in discussions. These online conversations<br />

are represented as threads, and the conversation trees<br />

within these threads are important in understanding the<br />

behaviour of online users. Unfortunately, the reply<br />

structures of these threads are generally not publicly<br />

accessible or not maintained. Hence, we introduce an<br />

efficient and simple approach to reconstruct the reply<br />

structure in threaded conversations. We contrast its<br />

accuracy against an existing and a baseline algorithm.<br />

1. Introduction<br />

Internet forums are an important part of the web for<br />

questions to be asked and answered and for public<br />

discussions on all types of topics. In forums, conversations<br />

are represented as a sequence of posts, or threads,<br />

where the posts are replies to one or more earlier posts.<br />

Links exist between posts if one is the direct reply to<br />

another. However, the reply structure of threads is not<br />

always available. For instance, the structure is not<br />

maintained by the provider, or lost. We propose a new<br />

method to reconstruct the reply structure of posts in<br />

forums. It uses a set of simple features and a decision<br />

tree classifier to reconstruct the reply structure of<br />

threads. We evaluate the accuracy of the algorithm<br />

against an existing and a heuristic baseline approach.<br />

2. Methodology<br />

Definitions A post in a thread provides us with the<br />

following, basic information: creation date, name of<br />

author, quoting: name of the quoted author and<br />

content. The creation date of posts establishes a<br />

chronological order. From that ordering we can<br />

compute the distance of one post to another. Distance<br />

means how far away is a post to its reply. If there is no<br />

other post between a post and its reply, then they have a<br />

post distance of 1. If there is another post in between,<br />

then the distance is 2, and so forth.<br />

Note that the data we use stores the reply interaction<br />

in the way that each post can only reply to one other<br />

post at once. Although a user can reply to several posts<br />

at once, and our approach is able to return more than<br />

one reply candidate, we limit replies to one target post<br />

in our evaluation.<br />

Baseline approaches In our data, we found that<br />

79.7% of the replies have a post distance of 1, i.e. they<br />

follow directly the post they refer to. Hence, our first<br />

baseline approach is to link each post to its immediate<br />

predecessor, called “1-Distance Linking”.<br />

80<br />

Wang et al. 2008 [1] introduced a thread<br />

reconstruction that relies on content similarity and post<br />

distance. That serves as our second baseline approach.<br />

Features Based on the information a pair of posts<br />

provides, we extract the following features for our<br />

classification task: reply distance, posting-time<br />

difference, author quoted and cosine similarity. The<br />

cosine similarity compares the contents of two posts and<br />

returns a similarity score from 0 to 1, where 0 means not<br />

similar and 1 means exactly equal.<br />

Classifier As a classifier, we investigate the widely<br />

used C4.5 decision tree algorithm. It handles huge<br />

amount of data very efficiently due to its relative<br />

simplicity which is important for our task to present a<br />

fast and efficient way of reconstructing threads.<br />

3. Evaluation<br />

For the evaluation we use a subset of our Boards.ie<br />

dataset. Namely 13,100 threads, consisting of 133,200<br />

posts in total.<br />

In order to compare classification results of the<br />

approaches, we use the measurements precision, recall<br />

and F-score where F-score is the harmonic mean of<br />

precision and recall. For training the classifier, we<br />

applied a 10 fold cross validation to minimise bias.<br />

Table 1 shows the comparison between our<br />

classification algorithm “ThreadRecon” and the two<br />

baseline approaches.<br />

Wang et al. 2008 1-Distance Linking ThreadRecon<br />

44.40% 79.70% 85.70%<br />

Table 1: F-score comparison between ThreadRecon and<br />

baseline approaches<br />

An extended version of this work will be published in<br />

“Reconstruction of Threaded Conversations in Online<br />

Discussion Forums”, International Conference on Weblogs<br />

and Social Media 2011<br />

8. References<br />

[1] Wang, Joshi, Cohen and Rosé. 2008. Recovering implicit<br />

thread structure in newsgroup style conversations. In<br />

Proceedings of the 2nd International Conference on Weblogs<br />

and Social Media (ICWSM II), 152<strong>–</strong>160

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!