Comparing the value of Latent Semantic Analysis on two English-to ...
Comparing the value of Latent Semantic Analysis on two English-to ...
Comparing the value of Latent Semantic Analysis on two English-to ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<str<strong>on</strong>g>Comparing</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>value</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Latent</str<strong>on</strong>g> <str<strong>on</strong>g>Semantic</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>two</strong> <strong>English</strong>-<strong>to</strong>-<br />
Ind<strong>on</strong>esian lexical mapping tasks<br />
Eliza Margaretha<br />
Faculty <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science<br />
University <str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia<br />
Depok, Ind<strong>on</strong>esia<br />
elm40@ui.edu<br />
Ruli Manurung<br />
Faculty <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science<br />
University <str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia<br />
Depok, Ind<strong>on</strong>esia<br />
maruli@cs.ui.ac.id<br />
Abstract<br />
This paper describes an experiment that attempts<br />
<strong>to</strong> au<strong>to</strong>matically map <strong>English</strong> words and<br />
c<strong>on</strong>cepts, derived from <str<strong>on</strong>g>the</str<strong>on</strong>g> Princet<strong>on</strong> WordNet,<br />
<strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir Ind<strong>on</strong>esian analogues appearing in a<br />
widely-used Ind<strong>on</strong>esian dicti<strong>on</strong>ary, using <str<strong>on</strong>g>Latent</str<strong>on</strong>g><br />
<str<strong>on</strong>g>Semantic</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> (LSA). A bilingual semantic<br />
model is derived from an <strong>English</strong>-<br />
Ind<strong>on</strong>esian parallel corpus. Given a particular<br />
word or c<strong>on</strong>cept, <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic model is <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />
used <strong>to</strong> identify its neighbours in a highdimensi<strong>on</strong>al<br />
semantic space. Results from various<br />
experiments indicate that for bilingual word<br />
mapping, LSA is c<strong>on</strong>sistently outperformed by<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space model, i.e. where <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
full-rank word-document matrix is applied. We<br />
speculate that this is due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
‘smoothing’ effect LSA has <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> worddocument<br />
matrix, whilst very useful for revealing<br />
implicit semantic patterns, blurs <str<strong>on</strong>g>the</str<strong>on</strong>g> cooccurrence<br />
informati<strong>on</strong> that is necessary for establishing<br />
word translati<strong>on</strong>s.<br />
1 Overview<br />
An <strong>on</strong>going project at <str<strong>on</strong>g>the</str<strong>on</strong>g> Informati<strong>on</strong> Retrieval<br />
Lab, Faculty <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science, University <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
Ind<strong>on</strong>esia, c<strong>on</strong>cerns <str<strong>on</strong>g>the</str<strong>on</strong>g> development <str<strong>on</strong>g>of</str<strong>on</strong>g> an Ind<strong>on</strong>esian<br />
WordNet 1 . To that end, <strong>on</strong>e major task c<strong>on</strong>cerns<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> mapping <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>two</strong> m<strong>on</strong>olingual dicti<strong>on</strong>aries<br />
at <strong>two</strong> different levels: bilingual word mapping,<br />
which seeks <strong>to</strong> find translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> a lexical entry<br />
from <strong>on</strong>e language <strong>to</strong> ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r, and bilingual c<strong>on</strong>cept<br />
mapping, which defines equivalence classes<br />
1 http://bahasa.cs.ui.ac.id/iwn<br />
over c<strong>on</strong>cepts defined in <strong>two</strong> language resources.<br />
In o<str<strong>on</strong>g>the</str<strong>on</strong>g>r words, we try <strong>to</strong> au<strong>to</strong>matically c<strong>on</strong>struct<br />
<strong>two</strong> variants <str<strong>on</strong>g>of</str<strong>on</strong>g> a bilingual dicti<strong>on</strong>ary between <strong>two</strong><br />
languages, i.e. <strong>on</strong>e with sense disambiguated entries<br />
and <strong>on</strong>e without.<br />
In this paper we present an extensi<strong>on</strong> <strong>to</strong> LSA in<strong>to</strong><br />
a bilingual c<strong>on</strong>text that is similar <strong>to</strong> (Rehder et<br />
al., 1997; Clodfelder, 2003; Deng and Gao, 2007),<br />
and <str<strong>on</strong>g>the</str<strong>on</strong>g>n apply it <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>two</strong> mapping tasks described<br />
above, specifically <strong>to</strong>wards <str<strong>on</strong>g>the</str<strong>on</strong>g> lexical resources<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> Princet<strong>on</strong> WordNet (Fellbaum, 1998),<br />
an <strong>English</strong> semantic lexic<strong>on</strong>, and <str<strong>on</strong>g>the</str<strong>on</strong>g> Kamus Besar<br />
Bahasa Ind<strong>on</strong>esia (KBBI) 2 , c<strong>on</strong>sidered by many <strong>to</strong><br />
be <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g>ficial dicti<strong>on</strong>ary <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian language.<br />
We first provide formal definiti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> our <strong>two</strong><br />
tasks <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual word and c<strong>on</strong>cept mapping (Secti<strong>on</strong><br />
2) before discussing how <str<strong>on</strong>g>the</str<strong>on</strong>g>se tasks can be<br />
au<strong>to</strong>mated using LSA (Secti<strong>on</strong> 3). We <str<strong>on</strong>g>the</str<strong>on</strong>g>n present<br />
our experiment design and results (Secti<strong>on</strong>s 4 and<br />
5) followed by an analysis and discussi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
results in Secti<strong>on</strong> 6.<br />
2 Task Definiti<strong>on</strong>s<br />
As menti<strong>on</strong>ed above, our work c<strong>on</strong>cerns <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>two</strong> m<strong>on</strong>olingual dicti<strong>on</strong>aries. In our work,<br />
we refer <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>se resources as WordNets due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
fact that we view <str<strong>on</strong>g>the</str<strong>on</strong>g>m as semantic lexic<strong>on</strong>s that<br />
index entries based <strong>on</strong> meaning. However, we do<br />
not c<strong>on</strong>sider o<str<strong>on</strong>g>the</str<strong>on</strong>g>r semantic relati<strong>on</strong>s typically associated<br />
with a WordNet such as hypernymy, hyp<strong>on</strong>ymy,<br />
etc. For our purposes, a WordNet can be<br />
2 The KBBI is <str<strong>on</strong>g>the</str<strong>on</strong>g> copyright <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Language Centre, Ind<strong>on</strong>esian<br />
Ministry <str<strong>on</strong>g>of</str<strong>on</strong>g> Nati<strong>on</strong>al Educati<strong>on</strong>.
formally defined as a 4-tuple , , χ, ω as follows:<br />
• A c<strong>on</strong>cept is a semantic entity, which<br />
represents a distinct, specific meaning. Each<br />
c<strong>on</strong>cept is associated with a gloss, which is a<br />
textual descripti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> its meaning. For example,<br />
we could define <strong>two</strong> c<strong>on</strong>cepts, and ,<br />
where <str<strong>on</strong>g>the</str<strong>on</strong>g> former is associated with <str<strong>on</strong>g>the</str<strong>on</strong>g> gloss<br />
“a financial instituti<strong>on</strong> that accepts deposits<br />
and channels <str<strong>on</strong>g>the</str<strong>on</strong>g> m<strong>on</strong>ey in<strong>to</strong> lending activities”<br />
and <str<strong>on</strong>g>the</str<strong>on</strong>g> latter with “sloping land (especially<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> slope beside a body <str<strong>on</strong>g>of</str<strong>on</strong>g> water”.<br />
• A word is an orthographic entity,<br />
which represents a word in a particular language<br />
(in <str<strong>on</strong>g>the</str<strong>on</strong>g> case <str<strong>on</strong>g>of</str<strong>on</strong>g> Princet<strong>on</strong> WordNet, <strong>English</strong>).<br />
For example, we could define <strong>two</strong> words,<br />
and , where <str<strong>on</strong>g>the</str<strong>on</strong>g> former represents <str<strong>on</strong>g>the</str<strong>on</strong>g> orthographic<br />
string bank and <str<strong>on</strong>g>the</str<strong>on</strong>g> latter represents<br />
spo<strong>on</strong>.<br />
• A word may c<strong>on</strong>vey several different c<strong>on</strong>cepts.<br />
The functi<strong>on</strong> : returns all c<strong>on</strong>cepts<br />
c<strong>on</strong>veyed by a particular word. Thus, ,<br />
where , returns , <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all<br />
c<strong>on</strong>cepts that can be c<strong>on</strong>veyed by w. Using <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
examples above, , .<br />
• C<strong>on</strong>versely, a c<strong>on</strong>cept may be c<strong>on</strong>veyed by<br />
several words. The functi<strong>on</strong> : returns<br />
all words that can c<strong>on</strong>vey a particular<br />
c<strong>on</strong>cept. Thus, , where , returns<br />
, <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all words that c<strong>on</strong>vey c.<br />
Using <str<strong>on</strong>g>the</str<strong>on</strong>g> examples above, <br />
.<br />
We can define different WordNets for different<br />
languages, e.g. , , , and<br />
, , , . We also introduce <str<strong>on</strong>g>the</str<strong>on</strong>g> notati<strong>on</strong><br />
<strong>to</strong> denote word in and <strong>to</strong> denote<br />
c<strong>on</strong>cept in . For <str<strong>on</strong>g>the</str<strong>on</strong>g> sake <str<strong>on</strong>g>of</str<strong>on</strong>g> our discussi<strong>on</strong>, we<br />
will assume <strong>to</strong> be an <strong>English</strong> WordNet, and <br />
<strong>to</strong> be an Ind<strong>on</strong>esian WordNet.<br />
If we make <str<strong>on</strong>g>the</str<strong>on</strong>g> assumpti<strong>on</strong> that c<strong>on</strong>cepts are<br />
language independent, and should <str<strong>on</strong>g>the</str<strong>on</strong>g>oretically<br />
share <str<strong>on</strong>g>the</str<strong>on</strong>g> same set <str<strong>on</strong>g>of</str<strong>on</strong>g> universal c<strong>on</strong>cepts, .<br />
In practice, however, we may have <strong>two</strong> WordNets<br />
with different c<strong>on</strong>ceptual representati<strong>on</strong>s, hence<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> distincti<strong>on</strong> between and . We introduce<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> relati<strong>on</strong> <strong>to</strong> denote <str<strong>on</strong>g>the</str<strong>on</strong>g> explicit<br />
mapping <str<strong>on</strong>g>of</str<strong>on</strong>g> equivalent c<strong>on</strong>cepts in and .<br />
We now describe <strong>two</strong> tasks that can be performed<br />
between and , namely bilingual c<strong>on</strong>cept<br />
mapping and bilingual word mapping.<br />
The task <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual c<strong>on</strong>cept mapping is essentially<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> establishment <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cept equivalence<br />
relati<strong>on</strong> E. For example, given <str<strong>on</strong>g>the</str<strong>on</strong>g> example c<strong>on</strong>cepts<br />
in Table 1, bilingual c<strong>on</strong>cept mapping seeks<br />
<strong>to</strong> establish , , , , , , ,<br />
.<br />
C<strong>on</strong>cept Word Gloss Example<br />
<br />
<br />
<br />
w an instance or single occasi<strong>on</strong> for some event “this time he succeeded”<br />
<br />
c <br />
<br />
w a suitable moment “it is time <strong>to</strong> go”<br />
<br />
c <br />
<br />
w a reading <str<strong>on</strong>g>of</str<strong>on</strong>g> a point in time as given by a clock (a word “do you know what time it is?”<br />
signifying <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> an event)<br />
<br />
c <br />
<br />
w <br />
kata untuk menyatakan kekerapan tindakan (a word<br />
signifying a particular instance <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>on</strong>going series <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
events)<br />
“dalam satu minggu ini, dia sudah empat<br />
kali datang ke rumahku” (this past week,<br />
she has come <strong>to</strong> my house four times)<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
w <br />
<br />
<br />
<br />
kata untuk menyatakan salah satu waktu terjadinya<br />
peristiwa yg merupakan bagian dari rangkaian peristiwa yg<br />
pernah dan masih akan terus terjadi (a word signifying a<br />
particular instance <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>on</strong>going series <str<strong>on</strong>g>of</str<strong>on</strong>g> events)<br />
saat yg tertentu untuk melakukan sesuatu (a specific time <strong>to</strong><br />
be doing something)<br />
saat tertentu, pada arloji jarumnya yg pendek menunjuk<br />
angka tertentu dan jarum panjang menunjuk angka 12 (<str<strong>on</strong>g>the</str<strong>on</strong>g><br />
point in time when <str<strong>on</strong>g>the</str<strong>on</strong>g> short hand <str<strong>on</strong>g>of</str<strong>on</strong>g> a clock points <strong>to</strong> a<br />
certain hour and <str<strong>on</strong>g>the</str<strong>on</strong>g> l<strong>on</strong>g hand points <strong>to</strong> 12)<br />
<br />
“untuk kali ini ia kena batunya” (this time<br />
he suffered for his acti<strong>on</strong>s)<br />
“waktu makan” (eating time)<br />
“ia bangun jam lima pagi” (she woke up at<br />
five o’clock)<br />
sebuah sungai yang kecil (a small river) “air di kali itu sangat keruh” (<str<strong>on</strong>g>the</str<strong>on</strong>g> water in<br />
that small river is very murky)<br />
Table 1. Sample C<strong>on</strong>cepts in and
The task <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual word mapping is <strong>to</strong> find,<br />
given word , <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all its plausible<br />
translati<strong>on</strong>s in , regardless <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cepts being<br />
c<strong>on</strong>veyed. We can also view this task as computing<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all words in that c<strong>on</strong>vey<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all c<strong>on</strong>cepts c<strong>on</strong>veyed by . Formally,<br />
we compute <str<strong>on</strong>g>the</str<strong>on</strong>g> set w where<br />
, and .<br />
For example, in Princet<strong>on</strong> WordNet, given<br />
<br />
(i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> orthographic form time),<br />
<br />
<br />
returns more than 15 different c<strong>on</strong>cepts,<br />
am<strong>on</strong>g o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs , , (see Table 1).<br />
In Ind<strong>on</strong>esian, assuming <str<strong>on</strong>g>the</str<strong>on</strong>g> relati<strong>on</strong> E as defined<br />
above, <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> words that c<strong>on</strong>vey , i.e.<br />
<br />
, includes (as in “kali ini dia berhasil”<br />
= “this time she succeeded”).<br />
On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, <br />
may include <br />
(as in “ini waktunya untuk pergi” = “it is time <strong>to</strong><br />
<br />
go”) and (as in “sekarang saatnya menjual<br />
saham” = “now is <str<strong>on</strong>g>the</str<strong>on</strong>g> time <strong>to</strong> sell shares”), and<br />
lastly, <br />
may include (as in “apa anda<br />
tahu jam berapa sekarang?” = “do you know what<br />
time it is now?”).<br />
Thus, <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual word mapping task seeks <strong>to</strong><br />
<br />
compute, for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> word , <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<br />
Ind<strong>on</strong>esian words , , , ,….<br />
Note that each <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se Ind<strong>on</strong>esian words may<br />
c<strong>on</strong>vey different c<strong>on</strong>cepts, e.g. <br />
may include<br />
in Table<br />
<br />
1.<br />
3 Au<strong>to</strong>matic mapping using <str<strong>on</strong>g>Latent</str<strong>on</strong>g> <str<strong>on</strong>g>Semantic</str<strong>on</strong>g><br />
<str<strong>on</strong>g>Analysis</str<strong>on</strong>g><br />
<str<strong>on</strong>g>Latent</str<strong>on</strong>g> semantic analysis, or simply LSA, is a method<br />
<strong>to</strong> discern underlying semantic informati<strong>on</strong><br />
from a given corpus <str<strong>on</strong>g>of</str<strong>on</strong>g> text, and <strong>to</strong> subsequently<br />
represent <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>textual meaning <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> words in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> corpus as a vec<strong>to</strong>r in a high-dimensi<strong>on</strong>al semantic<br />
space (Landauer et al., 1998). As such,<br />
LSA is a powerful method for word sense disambiguati<strong>on</strong>.<br />
The ma<str<strong>on</strong>g>the</str<strong>on</strong>g>matical foundati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> LSA is provided<br />
by <str<strong>on</strong>g>the</str<strong>on</strong>g> Singular Value Decompositi<strong>on</strong>, or<br />
SVD. Initially, a corpus is represented as an <br />
word-passage matrix M, where cell , <br />
represents <str<strong>on</strong>g>the</str<strong>on</strong>g> occurrence <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> -th word in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
-th passage. Thus, each row <str<strong>on</strong>g>of</str<strong>on</strong>g> represents a<br />
word and each column represents a passage. The<br />
SVD is <str<strong>on</strong>g>the</str<strong>on</strong>g>n applied <strong>to</strong> , decomposing it such<br />
that , where is an matrix <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
left singular vec<strong>to</strong>rs, is an matrix <str<strong>on</strong>g>of</str<strong>on</strong>g> right<br />
singular vec<strong>to</strong>rs, and is an matrix c<strong>on</strong>taining<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> singular <str<strong>on</strong>g>value</str<strong>on</strong>g>s <str<strong>on</strong>g>of</str<strong>on</strong>g> .<br />
Crucially, this decompositi<strong>on</strong> fac<strong>to</strong>rs using an<br />
orth<strong>on</strong>ormal basis that produces an optimal reduced<br />
rank approximati<strong>on</strong> matrix (Kalman, 1996).<br />
By reducing dimensi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> matrix irrelevant<br />
informati<strong>on</strong> and noise are removed. The optimal<br />
rank reducti<strong>on</strong> yields useful inducti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> implicit<br />
relati<strong>on</strong>s. However, finding <str<strong>on</strong>g>the</str<strong>on</strong>g> optimal level <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
rank reducti<strong>on</strong> is an empirical issue.<br />
LSA can be applied <strong>to</strong> exploit a parallel corpus<br />
<strong>to</strong> au<strong>to</strong>matically perform bilingual word and c<strong>on</strong>cept<br />
mapping. We define a parallel corpus P as a<br />
set <str<strong>on</strong>g>of</str<strong>on</strong>g> pairs , , where is a document<br />
written in <str<strong>on</strong>g>the</str<strong>on</strong>g> language <str<strong>on</strong>g>of</str<strong>on</strong>g> , and is its translati<strong>on</strong><br />
in <str<strong>on</strong>g>the</str<strong>on</strong>g> language <str<strong>on</strong>g>of</str<strong>on</strong>g> .<br />
Intuitively, we would expect that if <strong>two</strong> words<br />
<br />
and c<strong>on</strong>sistently occur in documents that<br />
are translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r, but not in o<str<strong>on</strong>g>the</str<strong>on</strong>g>r documents,<br />
that <str<strong>on</strong>g>the</str<strong>on</strong>g>y would at <str<strong>on</strong>g>the</str<strong>on</strong>g> very least be semantically<br />
related, and possibly even be translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r. For instance, imagine a parallel corpus<br />
c<strong>on</strong>sisting <str<strong>on</strong>g>of</str<strong>on</strong>g> news articles written in <strong>English</strong> and<br />
Ind<strong>on</strong>esian: in <strong>English</strong> articles where <str<strong>on</strong>g>the</str<strong>on</strong>g> word Japan<br />
occurs, we would expect <str<strong>on</strong>g>the</str<strong>on</strong>g> word Jepang <strong>to</strong><br />
occur in <str<strong>on</strong>g>the</str<strong>on</strong>g> corresp<strong>on</strong>ding Ind<strong>on</strong>esian articles.<br />
This intuiti<strong>on</strong> can be represented in a worddocument<br />
matrix as follows: let be a worddocument<br />
matrix <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>English</strong> documents and <br />
<strong>English</strong> words, and be a word-document matrix<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esian documents and Ind<strong>on</strong>esian<br />
words. The documents are arranged such that, for<br />
1, <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> document represented by<br />
column <str<strong>on</strong>g>of</str<strong>on</strong>g> and <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian document<br />
represented by column <str<strong>on</strong>g>of</str<strong>on</strong>g> form a pair <str<strong>on</strong>g>of</str<strong>on</strong>g> translati<strong>on</strong>s.<br />
Since <str<strong>on</strong>g>the</str<strong>on</strong>g>y are translati<strong>on</strong>s, we can view<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>m as occupying exactly <str<strong>on</strong>g>the</str<strong>on</strong>g> same point in semantic<br />
space, and could just as easily view column<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> both matrices as representing <str<strong>on</strong>g>the</str<strong>on</strong>g> uni<strong>on</strong>, or<br />
c<strong>on</strong>catenati<strong>on</strong>, <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>two</strong> articles.<br />
C<strong>on</strong>sequently, we can c<strong>on</strong>struct <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual<br />
word-document matrix<br />
<br />
<br />
<br />
which is an matrix where cell , <br />
c<strong>on</strong>tains <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> occurrences <str<strong>on</strong>g>of</str<strong>on</strong>g> word in<br />
article . Row i forms <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic vec<strong>to</strong>r <str<strong>on</strong>g>of</str<strong>on</strong>g>, for<br />
, an <strong>English</strong> word, and for , an Indo-
nesian word. C<strong>on</strong>versely, column forms a vec<strong>to</strong>r<br />
representing <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> and Ind<strong>on</strong>esian words<br />
appearing in translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> document .<br />
This approach is similar <strong>to</strong> that <str<strong>on</strong>g>of</str<strong>on</strong>g> (Rehder et al.,<br />
1997; Clodfelder, 2003; Deng and Gao, 2007). The<br />
SVD process is <str<strong>on</strong>g>the</str<strong>on</strong>g> same, while <str<strong>on</strong>g>the</str<strong>on</strong>g> usage is different.<br />
For example, (Rehder et al., 1997) employ<br />
SVD for cross language informati<strong>on</strong> retrieval. On<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, we use it <strong>to</strong> accomplish word and<br />
c<strong>on</strong>cept mappings.<br />
LSA can be applied <strong>to</strong> this bilingual worddocument<br />
matrix. Computing <str<strong>on</strong>g>the</str<strong>on</strong>g> SVD <str<strong>on</strong>g>of</str<strong>on</strong>g> this matrix<br />
and reducing <str<strong>on</strong>g>the</str<strong>on</strong>g> rank should unearth implicit<br />
patterns <str<strong>on</strong>g>of</str<strong>on</strong>g> semantic c<strong>on</strong>cepts. The vec<strong>to</strong>rs<br />
representing <strong>English</strong> and Ind<strong>on</strong>esian words that are<br />
closely related should have high similarity; word<br />
translati<strong>on</strong>s more so.<br />
To approximate <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual word mapping<br />
task, we compare <str<strong>on</strong>g>the</str<strong>on</strong>g> similarity between <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic<br />
vec<strong>to</strong>rs representing words in and .<br />
Specifically, for <str<strong>on</strong>g>the</str<strong>on</strong>g> first rows in which<br />
represent words in , we compute <str<strong>on</strong>g>the</str<strong>on</strong>g>ir similarity<br />
<strong>to</strong> each <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> last rows which represent<br />
words in . Given a large enough corpus, we<br />
would expect all words in and <strong>to</strong> be<br />
represented by rows in .<br />
To approximate <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual c<strong>on</strong>cept mapping<br />
task, we compare <str<strong>on</strong>g>the</str<strong>on</strong>g> similarity between <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic<br />
vec<strong>to</strong>rs representing c<strong>on</strong>cepts in and .<br />
These vec<strong>to</strong>rs can be approximated by first c<strong>on</strong>structing<br />
a set <str<strong>on</strong>g>of</str<strong>on</strong>g> textual c<strong>on</strong>text representing a<br />
c<strong>on</strong>cept . For example, we can include <str<strong>on</strong>g>the</str<strong>on</strong>g> words<br />
in <strong>to</strong>ge<str<strong>on</strong>g>the</str<strong>on</strong>g>r with <str<strong>on</strong>g>the</str<strong>on</strong>g> words from its gloss and<br />
example sentences. The semantic vec<strong>to</strong>r <str<strong>on</strong>g>of</str<strong>on</strong>g> a c<strong>on</strong>cept<br />
is <str<strong>on</strong>g>the</str<strong>on</strong>g>n a weighted average <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic<br />
vec<strong>to</strong>rs <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> words c<strong>on</strong>tained within this c<strong>on</strong>text<br />
set, i.e. rows in . Again, given a large enough<br />
corpus, we would expect enough <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se c<strong>on</strong>text<br />
words <strong>to</strong> be represented by rows in M <strong>to</strong> form an<br />
adequate semantic vec<strong>to</strong>r for <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cept .<br />
4 Experiments<br />
4.1 Existing Resources<br />
For <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> lexic<strong>on</strong>, we used <str<strong>on</strong>g>the</str<strong>on</strong>g> most current<br />
versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> WordNet (Fellbaum, 1998), versi<strong>on</strong><br />
3.0 3 . For each <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 117659 distinct synsets, we<br />
<strong>on</strong>ly use <str<strong>on</strong>g>the</str<strong>on</strong>g> following data: <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> words be-<br />
3 More specifically, <str<strong>on</strong>g>the</str<strong>on</strong>g> SQL versi<strong>on</strong> available from<br />
http://wnsqlbuilder.sourceforge.net<br />
l<strong>on</strong>ging <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> synset, <str<strong>on</strong>g>the</str<strong>on</strong>g> gloss, and example sentences,<br />
if any. The uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se resources yields<br />
a set 169583 unique words.<br />
For <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian lexic<strong>on</strong>, we used an electr<strong>on</strong>ic<br />
versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI developed at <str<strong>on</strong>g>the</str<strong>on</strong>g> University<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia. For each <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 85521 distinct word<br />
sense definiti<strong>on</strong>s, we use <str<strong>on</strong>g>the</str<strong>on</strong>g> following data: <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
list <str<strong>on</strong>g>of</str<strong>on</strong>g> sublemmas, i.e. inflected forms, al<strong>on</strong>g with<br />
gloss and example sentences, if any. The uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>se resources yields a set <str<strong>on</strong>g>of</str<strong>on</strong>g> 87171 unique words.<br />
Our main parallel corpus c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> 3273 <strong>English</strong><br />
and Ind<strong>on</strong>esian article pairs taken from <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
ANTARA news agency. This collecti<strong>on</strong> was developed<br />
by Mirna Adriani and M<strong>on</strong>ica Lestari Paramita<br />
at <str<strong>on</strong>g>the</str<strong>on</strong>g> Informati<strong>on</strong> Retrieval Lab, University<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia 4 .<br />
A bilingual <strong>English</strong>-Ind<strong>on</strong>esia dicti<strong>on</strong>ary was<br />
c<strong>on</strong>structed using various <strong>on</strong>line resources, including<br />
a handcrafted dicti<strong>on</strong>ary by Hantar<strong>to</strong> Widjaja 5 ,<br />
kamus.net, and Trans<strong>to</strong>ol v6.1, a commercial translati<strong>on</strong><br />
system. In <strong>to</strong>tal, this dicti<strong>on</strong>ary maps 37678<br />
unique <strong>English</strong> words <strong>to</strong> 60564 unique Ind<strong>on</strong>esian<br />
words.<br />
4.2 Bilingual Word Mapping<br />
Our experiment with bilingual word mapping was<br />
set up as follows: firstly, we define a collecti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
article pairs derived from <str<strong>on</strong>g>the</str<strong>on</strong>g> ANTARA collecti<strong>on</strong>,<br />
and from it we set up a bilingual word-document<br />
matrix (see Secti<strong>on</strong> 3). The LSA process is subsequently<br />
applied <strong>on</strong> this matrix, i.e. we first compute<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> SVD <str<strong>on</strong>g>of</str<strong>on</strong>g> this matrix, and <str<strong>on</strong>g>the</str<strong>on</strong>g>n use it <strong>to</strong><br />
compute <str<strong>on</strong>g>the</str<strong>on</strong>g> optimal -rank approximati<strong>on</strong>. Finally,<br />
based <strong>on</strong> this approximati<strong>on</strong>, for a randomly<br />
chosen set <str<strong>on</strong>g>of</str<strong>on</strong>g> vec<strong>to</strong>rs representing <strong>English</strong> words,<br />
we compute <str<strong>on</strong>g>the</str<strong>on</strong>g> nearest vec<strong>to</strong>rs representing <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
most similar Ind<strong>on</strong>esian words. This is c<strong>on</strong>venti<strong>on</strong>ally<br />
computed using <str<strong>on</strong>g>the</str<strong>on</strong>g> cosine <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> angle<br />
between <strong>two</strong> vec<strong>to</strong>rs.<br />
Within this general framework, <str<strong>on</strong>g>the</str<strong>on</strong>g>re are several<br />
variables that we experiment with, as follows:<br />
• Collecti<strong>on</strong> size. Three subsets <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> parallel<br />
corpus were randomly created: P 100 c<strong>on</strong>tains<br />
100 article pairs, P 500 c<strong>on</strong>tains 500 article<br />
pairs, and P 1000 c<strong>on</strong>tains 1000 article pairs.<br />
Each subsequent subset wholly c<strong>on</strong>tains <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
previous subsets, i.e. P 100 ⊂ P 500 ⊂ P 1000 .<br />
4 publicati<strong>on</strong> forthcoming<br />
5 http://hantar<strong>to</strong>.definiti<strong>on</strong>roadsafety.org
• Rank reducti<strong>on</strong>. For each collecti<strong>on</strong>, we applied<br />
LSA with different degrees <str<strong>on</strong>g>of</str<strong>on</strong>g> rank approximati<strong>on</strong>,<br />
namely 10%, 25%, and 50% <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
number <str<strong>on</strong>g>of</str<strong>on</strong>g> dimensi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> original collecti<strong>on</strong>.<br />
Thus, for P 100 we compute <str<strong>on</strong>g>the</str<strong>on</strong>g> 10, 25, and<br />
50-rank approximati<strong>on</strong>s, for P 500 we compute<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> 50, 125, and 250-rank approximati<strong>on</strong>s, and<br />
for P 1000 we compute <str<strong>on</strong>g>the</str<strong>on</strong>g> 100, 250, and 500-<br />
rank approximati<strong>on</strong>s.<br />
• Removal <str<strong>on</strong>g>of</str<strong>on</strong>g> s<strong>to</strong>pwords. S<strong>to</strong>pwords are words<br />
that appear numerously in a text, thus are assumed<br />
as insignificant <strong>to</strong> represent <str<strong>on</strong>g>the</str<strong>on</strong>g> specific<br />
c<strong>on</strong>text <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> text. It is a comm<strong>on</strong> technique<br />
used <strong>to</strong> improve performance <str<strong>on</strong>g>of</str<strong>on</strong>g> informati<strong>on</strong><br />
retrieval systems. It is applied in preprocessing<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong>s, i.e. removing all instances <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> s<strong>to</strong>pwords in <str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong>s before applying<br />
LSA.<br />
• Weighting. Two weighting schemes, namely<br />
TF-IDF and Log-Entropy, were applied <strong>to</strong> a<br />
word-document matrix separately.<br />
• Mapping Selecti<strong>on</strong>. For computing <str<strong>on</strong>g>the</str<strong>on</strong>g> precisi<strong>on</strong><br />
and recall <str<strong>on</strong>g>value</str<strong>on</strong>g>s, we experimented with<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> mapping results <strong>to</strong> c<strong>on</strong>sider: <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
<strong>to</strong>p 1, 10, 50, and 100 mappings based <strong>on</strong> similarity<br />
were taken.<br />
film<br />
filmnya<br />
sutradara<br />
garapan<br />
perfilman<br />
penayangan<br />
k<strong>on</strong>troversial<br />
koboi<br />
irasi<strong>on</strong>al<br />
frase<br />
(a)<br />
0.814<br />
0.698<br />
0.684<br />
0.581<br />
0.554<br />
0.544<br />
0.526<br />
0.482<br />
0.482<br />
0.482<br />
pembebanan<br />
kijang<br />
halmahera<br />
alumina<br />
terjadwal<br />
viskositas<br />
tabel<br />
royalti<br />
reklamasi<br />
penyimpan<br />
(b)<br />
0.973<br />
0.973<br />
0.973<br />
0.973<br />
0.973<br />
0.973<br />
0.973<br />
0.973<br />
0.973<br />
0.973<br />
Table 2. The Most 10 Similar Ind<strong>on</strong>esian Words for<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> Words (a) Film and (b) Billi<strong>on</strong><br />
As an example, Table 2 presents <str<strong>on</strong>g>the</str<strong>on</strong>g> results <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<br />
<br />
mapping and , i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>two</strong> <strong>English</strong><br />
words film and billi<strong>on</strong>, respectively <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir Ind<strong>on</strong>esian<br />
translati<strong>on</strong>s, using <str<strong>on</strong>g>the</str<strong>on</strong>g> P 1000 training collecti<strong>on</strong><br />
with 500-rank approximati<strong>on</strong>. No weighting<br />
was applied. The former shows a successful mapping,<br />
while <str<strong>on</strong>g>the</str<strong>on</strong>g> latter shows an unsuccessful <strong>on</strong>e.<br />
<br />
Bilingual LSA correctly maps <strong>to</strong> its translati<strong>on</strong>,<br />
, despite <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that <str<strong>on</strong>g>the</str<strong>on</strong>g>y are treated as<br />
<br />
separate elements, i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g>ir shared orthography is<br />
completely coincidental. Additi<strong>on</strong>ally, <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
Ind<strong>on</strong>esian words it suggests are semantically related,<br />
e.g. sutradara (direc<strong>to</strong>r), garapan (creati<strong>on</strong>),<br />
penayangan (screening), etc. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand,<br />
<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> suggested word mappings for are incorrect,<br />
and <str<strong>on</strong>g>the</str<strong>on</strong>g> correct translati<strong>on</strong>, milyar, is missing.<br />
We suspect this may be due <strong>to</strong> several fac<strong>to</strong>rs.<br />
Firstly, billi<strong>on</strong> does not by itself invoke a particular<br />
semantic frame, and thus its semantic vec<strong>to</strong>r might<br />
not suggest a specific c<strong>on</strong>ceptual domain. Sec<strong>on</strong>dly,<br />
billi<strong>on</strong> can sometimes be translated numerically<br />
instead <str<strong>on</strong>g>of</str<strong>on</strong>g> lexically. Lastly, this failure may also be<br />
due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> lack <str<strong>on</strong>g>of</str<strong>on</strong>g> data: <str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong> is simply <strong>to</strong>o<br />
small <strong>to</strong> provide useful statistics that represent semantic<br />
c<strong>on</strong>text. Similar LSA approaches are comm<strong>on</strong>ly<br />
trained <strong>on</strong> collecti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> text numbering in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> tens <str<strong>on</strong>g>of</str<strong>on</strong>g> thousands <str<strong>on</strong>g>of</str<strong>on</strong>g> articles.<br />
Note as well that <str<strong>on</strong>g>the</str<strong>on</strong>g> absolute vec<strong>to</strong>r cosine <str<strong>on</strong>g>value</str<strong>on</strong>g>s<br />
do not accurately reflect <str<strong>on</strong>g>the</str<strong>on</strong>g> correctness <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
word translati<strong>on</strong>s. To properly assess <str<strong>on</strong>g>the</str<strong>on</strong>g> results <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
this experiment, evaluati<strong>on</strong> against a gold standard<br />
is necessary. This is achieved by comparing its<br />
precisi<strong>on</strong> and recall against <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian words<br />
returned by <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual dicti<strong>on</strong>ary, i.e. how isomorphic<br />
is <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> LSA-derived word mappings<br />
with a human-authored set <str<strong>on</strong>g>of</str<strong>on</strong>g> word mappings?<br />
We provide a baseline as comparis<strong>on</strong>, which<br />
computes <str<strong>on</strong>g>the</str<strong>on</strong>g> nearness between <strong>English</strong> and Ind<strong>on</strong>esian<br />
words <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> original word-document occurrence<br />
frequency matrix. O<str<strong>on</strong>g>the</str<strong>on</strong>g>r approaches are<br />
possible, e.g. mutual informati<strong>on</strong> (Sari, 2007).<br />
Table 3(a)-(e) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> different aspects <str<strong>on</strong>g>of</str<strong>on</strong>g> our<br />
experiment results by averaging <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r variables.<br />
Table 3(a) c<strong>on</strong>firms our intuiti<strong>on</strong> that as<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong> size increases, <str<strong>on</strong>g>the</str<strong>on</strong>g> precisi<strong>on</strong> and recall<br />
<str<strong>on</strong>g>value</str<strong>on</strong>g>s also increase. Table 3(b) presents <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
effects <str<strong>on</strong>g>of</str<strong>on</strong>g> rank approximati<strong>on</strong>. It shows that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
higher <str<strong>on</strong>g>the</str<strong>on</strong>g> rank approximati<strong>on</strong> percentage, <str<strong>on</strong>g>the</str<strong>on</strong>g> better<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> mapping results. Note that a rank approximati<strong>on</strong><br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> 100% is equal <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> FREQ baseline <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
simply using <str<strong>on</strong>g>the</str<strong>on</strong>g> full-rank word-document matrix<br />
for computing vec<strong>to</strong>r space nearness. Table 3(c)<br />
suggests that s<strong>to</strong>pwords seem <strong>to</strong> help LSA <strong>to</strong> yield<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> correct mappings. It is believed that s<strong>to</strong>pwords<br />
are not bounded by semantic domains, thus do not<br />
carry any semantic bias. However, <strong>on</strong> account <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> small size <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong>, in coincidence,<br />
s<strong>to</strong>pwords, which c<strong>on</strong>sistently appear in a specific<br />
domain, may carry some semantic informati<strong>on</strong><br />
about <str<strong>on</strong>g>the</str<strong>on</strong>g> domain. Table 3(d) compares <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping<br />
results in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> weighting usage. It sug-
gests that weighting can improve <str<strong>on</strong>g>the</str<strong>on</strong>g> mappings.<br />
Additi<strong>on</strong>ally, Log-Entropy weighting yields <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
highest results. Table 3(e) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> comparis<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
mapping selecti<strong>on</strong>s. As <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> translati<strong>on</strong><br />
pairs selected increases, <str<strong>on</strong>g>the</str<strong>on</strong>g> precisi<strong>on</strong> <str<strong>on</strong>g>value</str<strong>on</strong>g> decreases.<br />
On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, as <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> translati<strong>on</strong><br />
pairs selected increases, <str<strong>on</strong>g>the</str<strong>on</strong>g> possibility <strong>to</strong><br />
find more pairs matching <str<strong>on</strong>g>the</str<strong>on</strong>g> pairs in bilingual dicti<strong>on</strong>ary<br />
increases. Thus, <str<strong>on</strong>g>the</str<strong>on</strong>g> recall <str<strong>on</strong>g>value</str<strong>on</strong>g> increases<br />
as well.<br />
Collecti<strong>on</strong> Size<br />
FREQ<br />
LSA<br />
P R P R<br />
0.0668 0.1840 0.0346 0.1053<br />
0.1301 0.2761 0.0974 0.2368<br />
0.1467 0.2857 0.1172 0.2603<br />
(a)<br />
Rank Approximati<strong>on</strong> P R<br />
10% 0.0680 0.1727<br />
25% 0.0845 0.2070<br />
50% 0.0967 0.2226<br />
100% 0.1009 0.2285<br />
(b)<br />
S<strong>to</strong>pwords<br />
FREQ<br />
LSA<br />
P R P R<br />
C<strong>on</strong>tained 0.1108 0.2465 0.0840 0.2051<br />
Removed 0.1138 0.2440 0.0822 0.1964<br />
(c)<br />
Weighting Usage<br />
FREQ<br />
LSA<br />
P R P R<br />
No Weighting 0.1009 0.2285 0.0757 0.1948<br />
Log-Entropy 0.1347 0.2753 0.1041 0.2274<br />
TF-IDF 0.1013 0.2319 0.0694 0.1802<br />
(d)<br />
Mapping<br />
FREQ<br />
LSA<br />
Selecti<strong>on</strong> P R P R<br />
Top 1 0.3758 0.1588 0.2380 0.0987<br />
Top 10 0.0567 0.2263 0.0434 0.1733<br />
Top 50 0.0163 0.2911 0.0133 0.2338<br />
Top 100 0.0094 0.3183 0.0081 0.2732<br />
(e)<br />
Table 3. Results <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual word mapping comparing<br />
(a) collecti<strong>on</strong> size, (b) rank approximati<strong>on</strong>, (c)<br />
removal <str<strong>on</strong>g>of</str<strong>on</strong>g> s<strong>to</strong>pwords, (d) weighting schemes, and (e)<br />
mapping selecti<strong>on</strong><br />
Most interestingly, however, is <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
FREQ baseline, which uses <str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space<br />
model, c<strong>on</strong>sistently outperforms LSA.<br />
4.3 Bilingual C<strong>on</strong>cept Mapping<br />
Using <str<strong>on</strong>g>the</str<strong>on</strong>g> same resources from <str<strong>on</strong>g>the</str<strong>on</strong>g> previous expe-<br />
riment, we ran an experiment <strong>to</strong> perform bilingual<br />
c<strong>on</strong>cept mapping by replacing <str<strong>on</strong>g>the</str<strong>on</strong>g> vec<strong>to</strong>rs <strong>to</strong> be<br />
compared with semantic vec<strong>to</strong>rs for c<strong>on</strong>cepts (see<br />
Secti<strong>on</strong> 3). For c<strong>on</strong>cept , i.e. a WordNet<br />
synset, we c<strong>on</strong>structed a set <str<strong>on</strong>g>of</str<strong>on</strong>g> textual c<strong>on</strong>text as<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> , <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> words in <str<strong>on</strong>g>the</str<strong>on</strong>g> gloss <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
, and <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> words in <str<strong>on</strong>g>the</str<strong>on</strong>g> example sentences<br />
associated with . To represent our intuiti<strong>on</strong> that<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> words in played more <str<strong>on</strong>g>of</str<strong>on</strong>g> an important<br />
role in defining <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic vec<strong>to</strong>r than <str<strong>on</strong>g>the</str<strong>on</strong>g> words<br />
in <str<strong>on</strong>g>the</str<strong>on</strong>g> gloss and example, we applied a weight <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
60%, 30%, and 10% <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> three comp<strong>on</strong>ents, respectively.<br />
Similarly, a semantic vec<strong>to</strong>r representing<br />
a c<strong>on</strong>cept , i.e. an Ind<strong>on</strong>esian word<br />
sense in <str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI, was c<strong>on</strong>structed from a textual<br />
c<strong>on</strong>text set composed <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> sublemma, <str<strong>on</strong>g>the</str<strong>on</strong>g> definiti<strong>on</strong>,<br />
and <str<strong>on</strong>g>the</str<strong>on</strong>g> example <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> word sense, using <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
same weightings. We <strong>on</strong>ly average word vec<strong>to</strong>rs if<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>y appear in <str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong> (depending <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
experimental variables used).<br />
We formulated an experiment which closely resembles<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> word sense disambiguati<strong>on</strong> problem:<br />
given a WordNet synset, <str<strong>on</strong>g>the</str<strong>on</strong>g> task is <strong>to</strong> select <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
most appropriate Ind<strong>on</strong>esian sense from a subset <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
senses that have been selected based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />
words appearing in our bilingual dicti<strong>on</strong>ary. These<br />
specific senses are called suggesti<strong>on</strong>s. Thus, instead<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> comparing <str<strong>on</strong>g>the</str<strong>on</strong>g> vec<strong>to</strong>r representing communicati<strong>on</strong><br />
with every single Ind<strong>on</strong>esian sense in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI, in this task we <strong>on</strong>ly compare it against<br />
suggesti<strong>on</strong>s with a limited range <str<strong>on</strong>g>of</str<strong>on</strong>g> sublemmas,<br />
e.g. komunikasi, perhubungan, hubungan, etc.<br />
This setup is thus identical <strong>to</strong> that <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>on</strong>going<br />
experiment here <strong>to</strong> manually map WordNet synsets<br />
<strong>to</strong> KBBI senses. C<strong>on</strong>sequently, this facilitates assessment<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> results by computing <str<strong>on</strong>g>the</str<strong>on</strong>g> level <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
agreement between <str<strong>on</strong>g>the</str<strong>on</strong>g> LSA-based mappings with<br />
human annotati<strong>on</strong>s.<br />
To illustrate, Table 4(a) and 4(b) presents a successful<br />
and unsuccessful example <str<strong>on</strong>g>of</str<strong>on</strong>g> mapping a<br />
WordNet synset. For each example we show <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
synset ID and <str<strong>on</strong>g>the</str<strong>on</strong>g> ideal textual c<strong>on</strong>text set, i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
set <str<strong>on</strong>g>of</str<strong>on</strong>g> words that c<strong>on</strong>vey <str<strong>on</strong>g>the</str<strong>on</strong>g> synset, its gloss and<br />
example sentences. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n show <str<strong>on</strong>g>the</str<strong>on</strong>g> actual textual<br />
c<strong>on</strong>text set with <str<strong>on</strong>g>the</str<strong>on</strong>g> notati<strong>on</strong> {{X}, {Y}, {Z}},<br />
where X, Y , and Z are <str<strong>on</strong>g>the</str<strong>on</strong>g> subset <str<strong>on</strong>g>of</str<strong>on</strong>g> words that<br />
appear in <str<strong>on</strong>g>the</str<strong>on</strong>g> training collecti<strong>on</strong>. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n show <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
Ind<strong>on</strong>esian word sense deemed <strong>to</strong> be most similar.<br />
For each sense we show <str<strong>on</strong>g>the</str<strong>on</strong>g> vec<strong>to</strong>r similarity score,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI ID and its ideal textual c<strong>on</strong>text set, i.e.<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> sublemma, its definiti<strong>on</strong> and example sen-
tences. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n show <str<strong>on</strong>g>the</str<strong>on</strong>g> actual textual c<strong>on</strong>text set<br />
with <str<strong>on</strong>g>the</str<strong>on</strong>g> same notati<strong>on</strong> as above.<br />
WordNet Synset ID: 100319939, Words: chase, following,<br />
pursual, pursuit, Gloss: <str<strong>on</strong>g>the</str<strong>on</strong>g> act <str<strong>on</strong>g>of</str<strong>on</strong>g> pursuing in an effort <strong>to</strong> overtake<br />
or capture, Example: <str<strong>on</strong>g>the</str<strong>on</strong>g> culprit started <strong>to</strong> run and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
cop <strong>to</strong>ok <str<strong>on</strong>g>of</str<strong>on</strong>g>f in pursuit, Textual c<strong>on</strong>text set: {{following,<br />
chase}, {<str<strong>on</strong>g>the</str<strong>on</strong>g>, effort, <str<strong>on</strong>g>of</str<strong>on</strong>g>, <strong>to</strong>, or, capture, in, act, pursuing, an},<br />
{<str<strong>on</strong>g>the</str<strong>on</strong>g>, <str<strong>on</strong>g>of</str<strong>on</strong>g>f, <strong>to</strong>ok, <strong>to</strong>, run, in, culprit, started, and}}<br />
KBBI ID: k39607 - Similarity: 0.804, Sublemma: mengejar,<br />
Definiti<strong>on</strong>: berlari untuk menyusul menangkap dsb memburu,<br />
Example: ia berusaha mengejar dan menangkap saya,<br />
Textual c<strong>on</strong>text set: {{mengejar}, {memburu, berlari, menangkap,<br />
untuk, menyusul},{berusaha, dan, ia, mengejar,<br />
saya, menangkap}}<br />
(a)<br />
WordNet synset ID: 201277784, Words: crease, furrow,<br />
wrinkle<br />
Gloss: make wrinkled or creased, Example: furrow <strong>on</strong>e’s<br />
brow,<br />
Textual c<strong>on</strong>text set: {{}, {or, make}, {s, <strong>on</strong>e}}<br />
KBBI ID: k02421 - Similarity: 0.69, Sublemma: alur, Definiti<strong>on</strong>:<br />
jalinan peristiwa dl karya sastra untuk mencapai efek<br />
tertentu pautannya dapat diwujudkan oleh hubungan temporal<br />
atau waktu dan oleh hubungan kausal atau sebab-akibat, Example:<br />
(n<strong>on</strong>e), Textual c<strong>on</strong>text set: {{alur}, {oleh, dan, atau,<br />
jalinan, peristiwa, diwujudkan, efek, dapat, karya, hubungan,<br />
waktu, mencapai, untuk, tertentu}, {}}<br />
(b)<br />
Table 4. Example <str<strong>on</strong>g>of</str<strong>on</strong>g> (a) Successful and (b) Unsuccessful<br />
C<strong>on</strong>cept Mappings<br />
In <str<strong>on</strong>g>the</str<strong>on</strong>g> first example, <str<strong>on</strong>g>the</str<strong>on</strong>g> textual c<strong>on</strong>text sets<br />
from both <str<strong>on</strong>g>the</str<strong>on</strong>g> WordNet synset and <str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI<br />
senses are fairly large, and provide sufficient c<strong>on</strong>text<br />
for LSA <strong>to</strong> choose <str<strong>on</strong>g>the</str<strong>on</strong>g> correct KBBI sense.<br />
However, in <str<strong>on</strong>g>the</str<strong>on</strong>g> sec<strong>on</strong>d example, <str<strong>on</strong>g>the</str<strong>on</strong>g> textual c<strong>on</strong>text<br />
set for <str<strong>on</strong>g>the</str<strong>on</strong>g> synset is very small, due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
words not appearing in <str<strong>on</strong>g>the</str<strong>on</strong>g> training collecti<strong>on</strong>. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore,<br />
it does not c<strong>on</strong>tain any <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> words that<br />
truly c<strong>on</strong>vey <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cept. As a result, LSA is unable<br />
<strong>to</strong> identify <str<strong>on</strong>g>the</str<strong>on</strong>g> correct KBBI sense.<br />
For this experiment, we used <str<strong>on</strong>g>the</str<strong>on</strong>g> P 1000 training<br />
collecti<strong>on</strong>. The results are presented in Table 5. As<br />
a baseline, we select three random suggested Ind<strong>on</strong>esian<br />
word senses as a mapping for an <strong>English</strong><br />
word sense. The reported random baseline in Table<br />
5 is an average <str<strong>on</strong>g>of</str<strong>on</strong>g> 10 separate runs. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r baseline<br />
was computed by comparing <strong>English</strong> comm<strong>on</strong>-based<br />
c<strong>on</strong>cepts <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir suggesti<strong>on</strong> based <strong>on</strong> a<br />
full rank word-document matrix. Top 3 Ind<strong>on</strong>esian<br />
c<strong>on</strong>cepts with <str<strong>on</strong>g>the</str<strong>on</strong>g> highest similarity <str<strong>on</strong>g>value</str<strong>on</strong>g>s are designated<br />
as <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping results. Subsequently, we<br />
compute <str<strong>on</strong>g>the</str<strong>on</strong>g> Fleiss kappa (Fleiss, 1971) <str<strong>on</strong>g>of</str<strong>on</strong>g> this result<br />
<strong>to</strong>ge<str<strong>on</strong>g>the</str<strong>on</strong>g>r with <str<strong>on</strong>g>the</str<strong>on</strong>g> human judgements.<br />
The average level <str<strong>on</strong>g>of</str<strong>on</strong>g> agreement between <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
LSA mappings 10% and <str<strong>on</strong>g>the</str<strong>on</strong>g> human judges<br />
(0.2713) is not as high as between <str<strong>on</strong>g>the</str<strong>on</strong>g> human<br />
judges <str<strong>on</strong>g>the</str<strong>on</strong>g>mselves (0.4831). Never<str<strong>on</strong>g>the</str<strong>on</strong>g>less, in general<br />
it is better than <str<strong>on</strong>g>the</str<strong>on</strong>g> random baseline (0.2380)<br />
and frequency baseline (0.2132), which suggests<br />
that LSA is indeed managing <strong>to</strong> capture some<br />
measure <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual semantic informati<strong>on</strong> implicit<br />
within <str<strong>on</strong>g>the</str<strong>on</strong>g> parallel corpus.<br />
Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, LSA mappings with 10% rank approximati<strong>on</strong><br />
yields higher levels <str<strong>on</strong>g>of</str<strong>on</strong>g> agreement than<br />
LSA with o<str<strong>on</strong>g>the</str<strong>on</strong>g>r rank approximati<strong>on</strong>s. It is c<strong>on</strong>tradic<strong>to</strong>ry<br />
with <str<strong>on</strong>g>the</str<strong>on</strong>g> word mapping results where LSA<br />
with bigger rank approximati<strong>on</strong>s yields higher results<br />
(Secti<strong>on</strong> 4.2).<br />
5 Discussi<strong>on</strong><br />
Previous works have shown LSA <strong>to</strong> c<strong>on</strong>tribute<br />
positive gains <strong>to</strong> similar tasks such as Cross Language<br />
Informati<strong>on</strong> Retrieval (Rehder et al., 1997).<br />
However, <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual word mapping results presented<br />
in Secti<strong>on</strong> 4.3 show <str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space<br />
model c<strong>on</strong>sistently outperforming LSA at that particular<br />
task, despite our initial intuiti<strong>on</strong> that LSA<br />
should actually improve precisi<strong>on</strong> and recall.<br />
We speculate that <str<strong>on</strong>g>the</str<strong>on</strong>g> task <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual word<br />
mapping may be even harder for LSA than that <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
Judges<br />
Synsets<br />
Judges <strong>on</strong>ly<br />
Judges +<br />
RNDM3<br />
Fleiss Kappa Values<br />
Judges + Judges +<br />
FREQ Top LSA 10%<br />
3<br />
Top3<br />
Judges +<br />
LSA 25%<br />
Top3<br />
Judges +<br />
LSA 50%<br />
Top3<br />
≥ 2 144 0.4269 0.1318 0.1667 0.1544 0.1606 0.1620<br />
≥ 3 24 0.4651 0.2197 0.2282 0.2334 0.2239 0.2185<br />
≥ 4 8 0.5765 0.3103 0.2282 0.3615 0.3329 0.3329<br />
≥ 5 4 0.4639 0.2900 0.2297 0.3359 0.3359 0.3359<br />
Average 0.4831 0.2380 0.2132 0.2713 0.2633 0.2623<br />
Table 5. Results <str<strong>on</strong>g>of</str<strong>on</strong>g> C<strong>on</strong>cept Mapping
ilingual c<strong>on</strong>cept mapping due <strong>to</strong> its finer alignment<br />
granularity. While c<strong>on</strong>cept mapping attempts<br />
<strong>to</strong> map a c<strong>on</strong>cept c<strong>on</strong>veyed by a group <str<strong>on</strong>g>of</str<strong>on</strong>g> semantically<br />
related words, word mapping attempts <strong>to</strong> map<br />
a word with a specific meaning <strong>to</strong> its translati<strong>on</strong> in<br />
ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r language.<br />
In <str<strong>on</strong>g>the</str<strong>on</strong>g>ory, LSA employs rank reducti<strong>on</strong> <strong>to</strong> remove<br />
noise and <strong>to</strong> reveal underlying informati<strong>on</strong><br />
c<strong>on</strong>tained in a corpus. LSA has a ‘smoothing’ effect<br />
<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> matrix, which is useful <strong>to</strong> discover general<br />
patterns, e.g. clustering documents by semantic<br />
domain. Our experiment results, however, generally<br />
shows <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency baseline, which employs<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> full rank word-document matrix, outperforming<br />
LSA.<br />
We speculate that <str<strong>on</strong>g>the</str<strong>on</strong>g> rank reducti<strong>on</strong> perhaps<br />
blurs some crucial details necessary for word mapping.<br />
The frequency baseline seems <strong>to</strong> encode<br />
more cooccurrence than LSA. It compares word<br />
vec<strong>to</strong>rs between <strong>English</strong> and Ind<strong>on</strong>esian that c<strong>on</strong>tain<br />
pure frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> word occurrence in each<br />
document. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, LSA encodes more<br />
semantic relatedness. It compares <strong>English</strong> and Ind<strong>on</strong>esian<br />
word vec<strong>to</strong>rs c<strong>on</strong>taining estimates <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
word frequency in documents according <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
c<strong>on</strong>text meaning. Since <str<strong>on</strong>g>the</str<strong>on</strong>g> purpose <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual<br />
word mapping is <strong>to</strong> obtain proper translati<strong>on</strong>s for<br />
an <strong>English</strong> word, it may be better explained as an<br />
issue <str<strong>on</strong>g>of</str<strong>on</strong>g> cooccurrence ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than semantic relatedness.<br />
That is, <str<strong>on</strong>g>the</str<strong>on</strong>g> higher <str<strong>on</strong>g>the</str<strong>on</strong>g> rate <str<strong>on</strong>g>of</str<strong>on</strong>g> cooccurrence<br />
between an <strong>English</strong> and an Ind<strong>on</strong>esian word, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
likelier <str<strong>on</strong>g>the</str<strong>on</strong>g>y are <strong>to</strong> be translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r.<br />
LSA may yield better results in <str<strong>on</strong>g>the</str<strong>on</strong>g> case <str<strong>on</strong>g>of</str<strong>on</strong>g> finding<br />
words with similar semantic domains. Thus,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> LSA mapping results should be better assessed<br />
using a resource listing semantically related terms,<br />
ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than using a bilingual dicti<strong>on</strong>ary listing<br />
translati<strong>on</strong> pairs. A bilingual dicti<strong>on</strong>ary demands<br />
more specific c<strong>on</strong>straints than semantic relatedness,<br />
as it specifies that <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping results should<br />
be <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>English</strong> word.<br />
Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, polysemous terms may become<br />
ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r problem for LSA. By rank approximati<strong>on</strong>,<br />
LSA estimates <str<strong>on</strong>g>the</str<strong>on</strong>g> occurrence frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> a word<br />
in a particular document. Since polysemy <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>English</strong><br />
terms and Ind<strong>on</strong>esian terms can be quite different,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> estimati<strong>on</strong>s for words which are mutual<br />
translati<strong>on</strong>s can be different. For instance, kali and<br />
waktu are Ind<strong>on</strong>esian translati<strong>on</strong>s for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong><br />
word time. However, kali is also <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian<br />
translati<strong>on</strong> for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> word river. Suppose<br />
kali and time appear frequently in documents<br />
about multiplicati<strong>on</strong>, but kali and river appear<br />
rarely in documents about river. Then, waktu and<br />
time appear frequently in documents about time.<br />
As a result, LSA may estimate kali with greater<br />
frequency in documents about multiplicati<strong>on</strong> and<br />
time, but with lower frequency in documents about<br />
river. The word vec<strong>to</strong>rs between kali and river<br />
may not be similar. Thus, in bilingual word mapping,<br />
LSA may not suggest kali as <str<strong>on</strong>g>the</str<strong>on</strong>g> proper<br />
translati<strong>on</strong> for river. Although polysemous words<br />
can also be a problem for <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency baseline,<br />
it merely uses raw word frequency vec<strong>to</strong>rs, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
problem does not affect o<str<strong>on</strong>g>the</str<strong>on</strong>g>r word vec<strong>to</strong>rs. LSA,<br />
<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, exacerbates this problem by taking<br />
it in<strong>to</strong> account in estimating o<str<strong>on</strong>g>the</str<strong>on</strong>g>r word frequencies.<br />
6 Summary<br />
We have presented a model <str<strong>on</strong>g>of</str<strong>on</strong>g> computing bilingual<br />
word and c<strong>on</strong>cept mappings between <strong>two</strong> semantic<br />
lexic<strong>on</strong>s, in our case Princet<strong>on</strong> WordNet and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
KBBI, using an extensi<strong>on</strong> <strong>to</strong> LSA that exploits implicit<br />
semantic informati<strong>on</strong> c<strong>on</strong>tained within a parallel<br />
corpus.<br />
The results, whilst far from c<strong>on</strong>clusive, indicate<br />
that that for bilingual word mapping, LSA is c<strong>on</strong>sistently<br />
outperformed by <str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space<br />
model, i.e. where <str<strong>on</strong>g>the</str<strong>on</strong>g> full-rank word-document matrix<br />
is applied, whereas for bilingual c<strong>on</strong>cept mapping<br />
LSA seems <strong>to</strong> slightly improve results. We<br />
speculate that this is due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that LSA,<br />
whilst very useful for revealing implicit semantic<br />
patterns, blurs <str<strong>on</strong>g>the</str<strong>on</strong>g> cooccurrence informati<strong>on</strong> that is<br />
necessary for establishing word translati<strong>on</strong>s.<br />
We suggest that, particularly for bilingual word<br />
mapping, a finer granularity <str<strong>on</strong>g>of</str<strong>on</strong>g> alignment, e.g. at<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> sentential level, may increase accuracy (Deng<br />
and Gao, 2007).<br />
Acknowledgment<br />
The work presented in this paper is supported by<br />
an RUUI (Riset Unggulan Universitas Ind<strong>on</strong>esia)<br />
2007 research grant from DRPM UI (Direk<strong>to</strong>rat<br />
Riset dan Pengabdian Masyarakat Universitas Ind<strong>on</strong>esia).<br />
We would also like <strong>to</strong> thank Franky for<br />
help in s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware implementati<strong>on</strong> and Desm<strong>on</strong>d<br />
Darma Putra for help in computing <str<strong>on</strong>g>the</str<strong>on</strong>g> Fleiss kappa<br />
<str<strong>on</strong>g>value</str<strong>on</strong>g>s in Secti<strong>on</strong> 4.3.
References<br />
Eneko Agirre and Philip Edm<strong>on</strong>ds, edi<strong>to</strong>rs. 2007. Word<br />
Sense Disambiguati<strong>on</strong>: Algorithms and Applicati<strong>on</strong>s.<br />
Springer.<br />
Katri A. Clodfelder. 2003. An LSA Implementati<strong>on</strong><br />
Against Parallel Texts in French and <strong>English</strong>. In<br />
Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> HLT-NAACL Workshop: Building<br />
and Using Parallel Texts: Data Driven Machine<br />
Translati<strong>on</strong> and Bey<strong>on</strong>d, 111–114.<br />
Y<strong>on</strong>ggang Deng and Yuqing Gao. June 2007. Guiding<br />
statistical word alignment models with prior knowledge.<br />
In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 45 th Annual Meeting <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> Associati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al Linguistics, pages<br />
1–8, Prague, Czech Republic. Associati<strong>on</strong> for Computati<strong>on</strong>al<br />
Linguistics.<br />
Christiane Fellbaum, edi<strong>to</strong>r. May 1998. WordNet: An<br />
Electr<strong>on</strong>ic Lexical Database. MIT Press.<br />
Joseph L. Fleiss. 1971.Measuring nominal scale agreement<br />
am<strong>on</strong>g many raters. Psychological Bulletin,<br />
76(5):378–382.<br />
Dan Kalman. 1996. A singularly valuable decompositi<strong>on</strong>:<br />
The svd <str<strong>on</strong>g>of</str<strong>on</strong>g> a matrix. The College Ma<str<strong>on</strong>g>the</str<strong>on</strong>g>matics<br />
Journal, 27(1):2–23.<br />
Thomas K. Landauer, Peter W. Foltz, and Darrell Laham.<br />
1998. An introducti<strong>on</strong> <strong>to</strong> latent semantic analysis.<br />
Discourse Processes, 25:259–284.<br />
Bob Rehder, Michael L. Littman, Susan T. Dumais, and<br />
Thomas K. Landauer. 1997. Au<strong>to</strong>matic 3-language<br />
cross-language informati<strong>on</strong> retrieval with latent semantic<br />
indexing. In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Sixth Text<br />
Retrieval C<strong>on</strong>ference (TREC-6), pages 233–239.<br />
Syandra Sari. 2007. Perolehan informasi lintas bahasa<br />
ind<strong>on</strong>esia-inggris berdasarkan korpus paralel dengan<br />
menggunakan me<strong>to</strong>da mutual informati<strong>on</strong> dan<br />
me<strong>to</strong>da similarity <str<strong>on</strong>g>the</str<strong>on</strong>g>saurus. Master’s <str<strong>on</strong>g>the</str<strong>on</strong>g>sis, Faculty<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science, University <str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia, Call<br />
number: T-0617.