29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Personalised Document Corpora Exploration<br />

Abstract<br />

In our research, we explore ways of enhancing<br />

personalised retrieval of relevant documents in open<br />

text document corpora. While a lot of research has been<br />

done in document retrieval, most approaches look at<br />

ways of improving single-step query answering.<br />

Because the users might have more complex<br />

information needs than they can express in a short<br />

query, our scope is to research how a system can<br />

actively assist the users in their exploratory search over<br />

a sequence of steps (i.e., queries).<br />

1. Introduction<br />

Internet has become one of the main information<br />

access points on the planet due to the great amounts of<br />

text documents it provides. Two types of information<br />

search have been identified: (i) look-up search (e.g., fact<br />

retrieval) and (ii) exploratory search (e.g., learning,<br />

investigation, analysis) [1]. Our focus is on enhancing<br />

exploratory search with machine learning techniques.<br />

Currently, there are very few approaches towards<br />

information exploration, while most of research in the<br />

area of personalised search focuses on the look-up<br />

search [1]. However, since many users come to a search<br />

engine for exploratory search, our research is directed<br />

towards a system which is able to assist the user in his<br />

“exploration” over an unlimited sequence of queries.<br />

An important challenge is that new information items<br />

become available every day and the system must be able<br />

to accommodate any dynamic corpus of documents in<br />

unsupervised fashion. Moreover, different users need<br />

different information even when they ask the same<br />

query, so user modeling is crucial for our work.<br />

2. Our Approach<br />

2.1 Domain modeling<br />

In order to be able to guide the user towards<br />

information items of interest, the system must keep a<br />

“browsable” domain model, and constantly map the<br />

user’s state to that model.<br />

In our approach, the text documents are firstly<br />

processed for extraction of probabilistic topic models.<br />

Latent Dirichlet Allocation (LDA) is a well-known<br />

algorithm for this purpose which we explore in this<br />

setting. The obtained topics are then used to label and<br />

cluster the documents. Many algorithms have been<br />

researched for document clustering. Still, one challenge<br />

is to identify for each user the level of granularity he can<br />

internalise or is interested in. For example, the level of<br />

details an expert feels comfortable with is not the same<br />

with that of a beginner in the same subject.<br />

Ioana Hulpuș, Conor Hayes<br />

DERI, National University of Ireland, <strong>Galway</strong><br />

{name.surname@deri.org}<br />

88<br />

Documents are also connected based on their<br />

citation and bibliographic networks. The resulting<br />

network structure enables the navigation of documents<br />

based on authors and their citations. The bibliographic<br />

links enable the identification of communities, which is<br />

currently a highly researched topic. However, these<br />

communities are created based on citations, and at the<br />

same time, one of the most common ways of scientific<br />

data exploration is by following citation paths. Thus, by<br />

traversing only bibliographic paths, the users can get<br />

stuck into the point of view of one community. Our aim<br />

is to use the available research in the direction of<br />

community detection and make the user aware of the<br />

different communities dealing with his topics of interest.<br />

2.2 Trace models and trace based reasoning<br />

This model captures the “interaction traces” between<br />

the user and the system. Trace-based reasoning (TBR)<br />

has its root in case-based reasoning, as it solves new<br />

problems by adapting past solutions (i.e., user traces).<br />

Past traces can be used to infer the context of a user’s<br />

search, which is very important for assessing relevance<br />

of information. At the same time, machine learning can<br />

be used on the repository of traces, in order to identify<br />

patterns of information requests. The idea behind this is<br />

that if some documents or queries are constantly<br />

accessed together by various users, they might be<br />

related.<br />

2.3 User modeling<br />

User modeling is very much related to trace<br />

modeling as each user has assigned his past traces. As<br />

well, the system will try to capture the user’s level of<br />

expertise and his topics of interest, through a mixed<br />

initiative approach.<br />

3. Conclusions<br />

Exploratory search is still an under-studied area of<br />

personalised information retrieval. In order to address<br />

this topic, we plan to use machine learning for (i)<br />

creating a browsable hierarchical structure of the<br />

domain, (ii) inferring the user’s expertise, (iii) and to<br />

identify latent document relations. This knowledge is<br />

then used to guide the user and recommend documents<br />

suitable to his information needs.<br />

4. References<br />

[1] Marchionini, G. (2006) Exploratory search: From<br />

finding to understanding. Communications of the<br />

ACM, 49(4), pp. 41<strong>–</strong>46.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!