28.02.2013 Views

3rd meeting of young researchers at UP 1 - IJUP - Universidade do ...

3rd meeting of young researchers at UP 1 - IJUP - Universidade do ...

3rd meeting of young researchers at UP 1 - IJUP - Universidade do ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Proximity Analysis among Researchers <strong>at</strong> <strong>UP</strong> and some Institutions<br />

using Bibliographic D<strong>at</strong>abases<br />

L. Trigo 1<br />

1 Faculty <strong>of</strong> Economics, University <strong>of</strong> Porto, Portugal.<br />

Before starting an investig<strong>at</strong>ion in particular, the researcher <strong>of</strong>ten might want to know which other<br />

<strong>researchers</strong> share the same them<strong>at</strong>ic concerns. Large research institutions, like <strong>UP</strong>, face the problem<br />

th<strong>at</strong> searching and finding people with similar interests can be a complex task.<br />

The objective <strong>of</strong> this work is to autom<strong>at</strong>e the construction <strong>of</strong> a proximity m<strong>at</strong>rix among <strong>researchers</strong><br />

from <strong>UP</strong> and INESC Porto research unit. The m<strong>at</strong>rix can be simply transformed into a graph<br />

represent<strong>at</strong>ion. In the first phase, when our concern was to develop a working prototype, we have<br />

focused on a few <strong>researchers</strong> from LIAAD-INESC Porto only. This phase was completed. Some results<br />

are shown for illustr<strong>at</strong>ion below. In the next phase we will extend the set <strong>of</strong> <strong>researchers</strong> to other<br />

institutions, including the members <strong>of</strong> some units <strong>of</strong> INESC Porto and/or some other R&D unit<br />

associ<strong>at</strong>ed with <strong>UP</strong> (e.g. CRACS). Then we will gradually extend the prototype to more and more<br />

institutions and people.<br />

This goal <strong>of</strong> constructing a proximity m<strong>at</strong>rix is achieved with the help <strong>of</strong> existing text mining<br />

techniques and applying them to the items found in bibliographic d<strong>at</strong>abases. The prototype was<br />

developed in language R, which includes package “tm” th<strong>at</strong> facilit<strong>at</strong>es this work, as it contains many<br />

useful functions th<strong>at</strong> can be exploited. The first task is to collect names from the institution web site<br />

(e.g. web site <strong>of</strong> LIAAD). In the next step we collect the list <strong>of</strong> public<strong>at</strong>ion titles for each person. We<br />

can find these, for instance, on bibliographic d<strong>at</strong>abase <strong>of</strong> Digital Bibliography & Library Project<br />

(DBLP).<br />

After <strong>do</strong>wnloading each page rel<strong>at</strong>ive to the respective researcher, we have to process the html files to<br />

retrieve the paper titles. The resulting <strong>do</strong>cument collection represents our target corpus. Each<br />

“<strong>do</strong>cument” characterizes a particular researcher. It includes all words th<strong>at</strong> appear in his/her<br />

public<strong>at</strong>ions and their frequencies. The <strong>do</strong>cument collection can be processed by applying stemming,<br />

removal <strong>of</strong> stop words, spaces, punctu<strong>at</strong>ion and numbers. The result is called <strong>do</strong>cument-term m<strong>at</strong>rix.<br />

The lines correspond to individual <strong>researchers</strong>. In the next step it is possible to calcul<strong>at</strong>e a cosine<br />

measure <strong>of</strong> proximity / distance for any two lines (<strong>researchers</strong>). The result can be represented in the<br />

form <strong>of</strong> a m<strong>at</strong>rix showing all proximity figures. An example <strong>of</strong> a dissimilarity m<strong>at</strong>rix th<strong>at</strong> has been<br />

obtained with our prototype is shown below.<br />

#1 #2 #3 #4 #5 #6<br />

1 – J. Gama 0 0.573 0.959 0.529 0.538 0.595<br />

2 – P. Brazdil 0.573 0 0.981 0.672 0.705 0.623<br />

3 – A. M. Jorge 0.959 0.981 0 0.98 0.893 1<br />

4 – L. Torgo 0.529 0.672 0.98 0 0.565 0.676<br />

5 – C. Soares 0.538 0.705 0.893 0.565 0 0.816<br />

6 – J. F. Gonçalves 0.595 0.623 1 0.676 0.816 0<br />

For instance, the most rel<strong>at</strong>ed work to J. Gama is the work from L.Torgo (0.529). And the second most<br />

rel<strong>at</strong>ed work to J. Gama is the one from C. Soares (0.538).<br />

References<br />

[1] R. Feldman, J. Sanger: The Text Mining Textbook: Advanced Approaches in Analyzing<br />

Unstructured D<strong>at</strong>a, Cambridge Univ. Press, 2007<br />

Acknowledgements: I wish to express my gr<strong>at</strong>itude to Pr<strong>of</strong>. P. Brazdil (LIAAD / FEP) for suggesting<br />

this problem and for his supervision.<br />

3 rd <strong>meeting</strong> <strong>of</strong> <strong>young</strong> <strong>researchers</strong> <strong>at</strong> <strong>UP</strong> 239

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!