26.04.2015 Views

Comparing the value of Latent Semantic Analysis on two English-to ...

Comparing the value of Latent Semantic Analysis on two English-to ...

Comparing the value of Latent Semantic Analysis on two English-to ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Comparing</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>value</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Latent</str<strong>on</strong>g> <str<strong>on</strong>g>Semantic</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>two</strong> <strong>English</strong>-<strong>to</strong>-<br />

Ind<strong>on</strong>esian lexical mapping tasks<br />

Eliza Margaretha<br />

Faculty <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science<br />

University <str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia<br />

Depok, Ind<strong>on</strong>esia<br />

elm40@ui.edu<br />

Ruli Manurung<br />

Faculty <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science<br />

University <str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia<br />

Depok, Ind<strong>on</strong>esia<br />

maruli@cs.ui.ac.id<br />

Abstract<br />

This paper describes an experiment that attempts<br />

<strong>to</strong> au<strong>to</strong>matically map <strong>English</strong> words and<br />

c<strong>on</strong>cepts, derived from <str<strong>on</strong>g>the</str<strong>on</strong>g> Princet<strong>on</strong> WordNet,<br />

<strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir Ind<strong>on</strong>esian analogues appearing in a<br />

widely-used Ind<strong>on</strong>esian dicti<strong>on</strong>ary, using <str<strong>on</strong>g>Latent</str<strong>on</strong>g><br />

<str<strong>on</strong>g>Semantic</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> (LSA). A bilingual semantic<br />

model is derived from an <strong>English</strong>-<br />

Ind<strong>on</strong>esian parallel corpus. Given a particular<br />

word or c<strong>on</strong>cept, <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic model is <str<strong>on</strong>g>the</str<strong>on</strong>g>n<br />

used <strong>to</strong> identify its neighbours in a highdimensi<strong>on</strong>al<br />

semantic space. Results from various<br />

experiments indicate that for bilingual word<br />

mapping, LSA is c<strong>on</strong>sistently outperformed by<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space model, i.e. where <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

full-rank word-document matrix is applied. We<br />

speculate that this is due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

‘smoothing’ effect LSA has <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> worddocument<br />

matrix, whilst very useful for revealing<br />

implicit semantic patterns, blurs <str<strong>on</strong>g>the</str<strong>on</strong>g> cooccurrence<br />

informati<strong>on</strong> that is necessary for establishing<br />

word translati<strong>on</strong>s.<br />

1 Overview<br />

An <strong>on</strong>going project at <str<strong>on</strong>g>the</str<strong>on</strong>g> Informati<strong>on</strong> Retrieval<br />

Lab, Faculty <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science, University <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

Ind<strong>on</strong>esia, c<strong>on</strong>cerns <str<strong>on</strong>g>the</str<strong>on</strong>g> development <str<strong>on</strong>g>of</str<strong>on</strong>g> an Ind<strong>on</strong>esian<br />

WordNet 1 . To that end, <strong>on</strong>e major task c<strong>on</strong>cerns<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> mapping <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>two</strong> m<strong>on</strong>olingual dicti<strong>on</strong>aries<br />

at <strong>two</strong> different levels: bilingual word mapping,<br />

which seeks <strong>to</strong> find translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> a lexical entry<br />

from <strong>on</strong>e language <strong>to</strong> ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r, and bilingual c<strong>on</strong>cept<br />

mapping, which defines equivalence classes<br />

1 http://bahasa.cs.ui.ac.id/iwn<br />

over c<strong>on</strong>cepts defined in <strong>two</strong> language resources.<br />

In o<str<strong>on</strong>g>the</str<strong>on</strong>g>r words, we try <strong>to</strong> au<strong>to</strong>matically c<strong>on</strong>struct<br />

<strong>two</strong> variants <str<strong>on</strong>g>of</str<strong>on</strong>g> a bilingual dicti<strong>on</strong>ary between <strong>two</strong><br />

languages, i.e. <strong>on</strong>e with sense disambiguated entries<br />

and <strong>on</strong>e without.<br />

In this paper we present an extensi<strong>on</strong> <strong>to</strong> LSA in<strong>to</strong><br />

a bilingual c<strong>on</strong>text that is similar <strong>to</strong> (Rehder et<br />

al., 1997; Clodfelder, 2003; Deng and Gao, 2007),<br />

and <str<strong>on</strong>g>the</str<strong>on</strong>g>n apply it <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>two</strong> mapping tasks described<br />

above, specifically <strong>to</strong>wards <str<strong>on</strong>g>the</str<strong>on</strong>g> lexical resources<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> Princet<strong>on</strong> WordNet (Fellbaum, 1998),<br />

an <strong>English</strong> semantic lexic<strong>on</strong>, and <str<strong>on</strong>g>the</str<strong>on</strong>g> Kamus Besar<br />

Bahasa Ind<strong>on</strong>esia (KBBI) 2 , c<strong>on</strong>sidered by many <strong>to</strong><br />

be <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g>ficial dicti<strong>on</strong>ary <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian language.<br />

We first provide formal definiti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> our <strong>two</strong><br />

tasks <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual word and c<strong>on</strong>cept mapping (Secti<strong>on</strong><br />

2) before discussing how <str<strong>on</strong>g>the</str<strong>on</strong>g>se tasks can be<br />

au<strong>to</strong>mated using LSA (Secti<strong>on</strong> 3). We <str<strong>on</strong>g>the</str<strong>on</strong>g>n present<br />

our experiment design and results (Secti<strong>on</strong>s 4 and<br />

5) followed by an analysis and discussi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

results in Secti<strong>on</strong> 6.<br />

2 Task Definiti<strong>on</strong>s<br />

As menti<strong>on</strong>ed above, our work c<strong>on</strong>cerns <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>two</strong> m<strong>on</strong>olingual dicti<strong>on</strong>aries. In our work,<br />

we refer <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>se resources as WordNets due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

fact that we view <str<strong>on</strong>g>the</str<strong>on</strong>g>m as semantic lexic<strong>on</strong>s that<br />

index entries based <strong>on</strong> meaning. However, we do<br />

not c<strong>on</strong>sider o<str<strong>on</strong>g>the</str<strong>on</strong>g>r semantic relati<strong>on</strong>s typically associated<br />

with a WordNet such as hypernymy, hyp<strong>on</strong>ymy,<br />

etc. For our purposes, a WordNet can be<br />

2 The KBBI is <str<strong>on</strong>g>the</str<strong>on</strong>g> copyright <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Language Centre, Ind<strong>on</strong>esian<br />

Ministry <str<strong>on</strong>g>of</str<strong>on</strong>g> Nati<strong>on</strong>al Educati<strong>on</strong>.


formally defined as a 4-tuple , , χ, ω as follows:<br />

• A c<strong>on</strong>cept is a semantic entity, which<br />

represents a distinct, specific meaning. Each<br />

c<strong>on</strong>cept is associated with a gloss, which is a<br />

textual descripti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> its meaning. For example,<br />

we could define <strong>two</strong> c<strong>on</strong>cepts, and ,<br />

where <str<strong>on</strong>g>the</str<strong>on</strong>g> former is associated with <str<strong>on</strong>g>the</str<strong>on</strong>g> gloss<br />

“a financial instituti<strong>on</strong> that accepts deposits<br />

and channels <str<strong>on</strong>g>the</str<strong>on</strong>g> m<strong>on</strong>ey in<strong>to</strong> lending activities”<br />

and <str<strong>on</strong>g>the</str<strong>on</strong>g> latter with “sloping land (especially<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> slope beside a body <str<strong>on</strong>g>of</str<strong>on</strong>g> water”.<br />

• A word is an orthographic entity,<br />

which represents a word in a particular language<br />

(in <str<strong>on</strong>g>the</str<strong>on</strong>g> case <str<strong>on</strong>g>of</str<strong>on</strong>g> Princet<strong>on</strong> WordNet, <strong>English</strong>).<br />

For example, we could define <strong>two</strong> words,<br />

and , where <str<strong>on</strong>g>the</str<strong>on</strong>g> former represents <str<strong>on</strong>g>the</str<strong>on</strong>g> orthographic<br />

string bank and <str<strong>on</strong>g>the</str<strong>on</strong>g> latter represents<br />

spo<strong>on</strong>.<br />

• A word may c<strong>on</strong>vey several different c<strong>on</strong>cepts.<br />

The functi<strong>on</strong> : returns all c<strong>on</strong>cepts<br />

c<strong>on</strong>veyed by a particular word. Thus, ,<br />

where , returns , <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all<br />

c<strong>on</strong>cepts that can be c<strong>on</strong>veyed by w. Using <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

examples above, , .<br />

• C<strong>on</strong>versely, a c<strong>on</strong>cept may be c<strong>on</strong>veyed by<br />

several words. The functi<strong>on</strong> : returns<br />

all words that can c<strong>on</strong>vey a particular<br />

c<strong>on</strong>cept. Thus, , where , returns<br />

, <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all words that c<strong>on</strong>vey c.<br />

Using <str<strong>on</strong>g>the</str<strong>on</strong>g> examples above, <br />

.<br />

We can define different WordNets for different<br />

languages, e.g. , , , and<br />

, , , . We also introduce <str<strong>on</strong>g>the</str<strong>on</strong>g> notati<strong>on</strong><br />

<strong>to</strong> denote word in and <strong>to</strong> denote<br />

c<strong>on</strong>cept in . For <str<strong>on</strong>g>the</str<strong>on</strong>g> sake <str<strong>on</strong>g>of</str<strong>on</strong>g> our discussi<strong>on</strong>, we<br />

will assume <strong>to</strong> be an <strong>English</strong> WordNet, and <br />

<strong>to</strong> be an Ind<strong>on</strong>esian WordNet.<br />

If we make <str<strong>on</strong>g>the</str<strong>on</strong>g> assumpti<strong>on</strong> that c<strong>on</strong>cepts are<br />

language independent, and should <str<strong>on</strong>g>the</str<strong>on</strong>g>oretically<br />

share <str<strong>on</strong>g>the</str<strong>on</strong>g> same set <str<strong>on</strong>g>of</str<strong>on</strong>g> universal c<strong>on</strong>cepts, .<br />

In practice, however, we may have <strong>two</strong> WordNets<br />

with different c<strong>on</strong>ceptual representati<strong>on</strong>s, hence<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> distincti<strong>on</strong> between and . We introduce<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> relati<strong>on</strong> <strong>to</strong> denote <str<strong>on</strong>g>the</str<strong>on</strong>g> explicit<br />

mapping <str<strong>on</strong>g>of</str<strong>on</strong>g> equivalent c<strong>on</strong>cepts in and .<br />

We now describe <strong>two</strong> tasks that can be performed<br />

between and , namely bilingual c<strong>on</strong>cept<br />

mapping and bilingual word mapping.<br />

The task <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual c<strong>on</strong>cept mapping is essentially<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> establishment <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cept equivalence<br />

relati<strong>on</strong> E. For example, given <str<strong>on</strong>g>the</str<strong>on</strong>g> example c<strong>on</strong>cepts<br />

in Table 1, bilingual c<strong>on</strong>cept mapping seeks<br />

<strong>to</strong> establish , , , , , , ,<br />

.<br />

C<strong>on</strong>cept Word Gloss Example<br />

<br />

<br />

<br />

w an instance or single occasi<strong>on</strong> for some event “this time he succeeded”<br />

<br />

c <br />

<br />

w a suitable moment “it is time <strong>to</strong> go”<br />

<br />

c <br />

<br />

w a reading <str<strong>on</strong>g>of</str<strong>on</strong>g> a point in time as given by a clock (a word “do you know what time it is?”<br />

signifying <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> an event)<br />

<br />

c <br />

<br />

w <br />

kata untuk menyatakan kekerapan tindakan (a word<br />

signifying a particular instance <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>on</strong>going series <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

events)<br />

“dalam satu minggu ini, dia sudah empat<br />

kali datang ke rumahku” (this past week,<br />

she has come <strong>to</strong> my house four times)<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

w <br />

<br />

<br />

<br />

kata untuk menyatakan salah satu waktu terjadinya<br />

peristiwa yg merupakan bagian dari rangkaian peristiwa yg<br />

pernah dan masih akan terus terjadi (a word signifying a<br />

particular instance <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>on</strong>going series <str<strong>on</strong>g>of</str<strong>on</strong>g> events)<br />

saat yg tertentu untuk melakukan sesuatu (a specific time <strong>to</strong><br />

be doing something)<br />

saat tertentu, pada arloji jarumnya yg pendek menunjuk<br />

angka tertentu dan jarum panjang menunjuk angka 12 (<str<strong>on</strong>g>the</str<strong>on</strong>g><br />

point in time when <str<strong>on</strong>g>the</str<strong>on</strong>g> short hand <str<strong>on</strong>g>of</str<strong>on</strong>g> a clock points <strong>to</strong> a<br />

certain hour and <str<strong>on</strong>g>the</str<strong>on</strong>g> l<strong>on</strong>g hand points <strong>to</strong> 12)<br />

<br />

“untuk kali ini ia kena batunya” (this time<br />

he suffered for his acti<strong>on</strong>s)<br />

“waktu makan” (eating time)<br />

“ia bangun jam lima pagi” (she woke up at<br />

five o’clock)<br />

sebuah sungai yang kecil (a small river) “air di kali itu sangat keruh” (<str<strong>on</strong>g>the</str<strong>on</strong>g> water in<br />

that small river is very murky)<br />

Table 1. Sample C<strong>on</strong>cepts in and


The task <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual word mapping is <strong>to</strong> find,<br />

given word , <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all its plausible<br />

translati<strong>on</strong>s in , regardless <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cepts being<br />

c<strong>on</strong>veyed. We can also view this task as computing<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all words in that c<strong>on</strong>vey<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> all c<strong>on</strong>cepts c<strong>on</strong>veyed by . Formally,<br />

we compute <str<strong>on</strong>g>the</str<strong>on</strong>g> set w where<br />

, and .<br />

For example, in Princet<strong>on</strong> WordNet, given<br />

<br />

(i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> orthographic form time),<br />

<br />

<br />

returns more than 15 different c<strong>on</strong>cepts,<br />

am<strong>on</strong>g o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs , , (see Table 1).<br />

In Ind<strong>on</strong>esian, assuming <str<strong>on</strong>g>the</str<strong>on</strong>g> relati<strong>on</strong> E as defined<br />

above, <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> words that c<strong>on</strong>vey , i.e.<br />

<br />

, includes (as in “kali ini dia berhasil”<br />

= “this time she succeeded”).<br />

On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, <br />

may include <br />

(as in “ini waktunya untuk pergi” = “it is time <strong>to</strong><br />

<br />

go”) and (as in “sekarang saatnya menjual<br />

saham” = “now is <str<strong>on</strong>g>the</str<strong>on</strong>g> time <strong>to</strong> sell shares”), and<br />

lastly, <br />

may include (as in “apa anda<br />

tahu jam berapa sekarang?” = “do you know what<br />

time it is now?”).<br />

Thus, <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual word mapping task seeks <strong>to</strong><br />

<br />

compute, for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> word , <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<br />

Ind<strong>on</strong>esian words , , , ,….<br />

Note that each <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se Ind<strong>on</strong>esian words may<br />

c<strong>on</strong>vey different c<strong>on</strong>cepts, e.g. <br />

may include<br />

in Table<br />

<br />

1.<br />

3 Au<strong>to</strong>matic mapping using <str<strong>on</strong>g>Latent</str<strong>on</strong>g> <str<strong>on</strong>g>Semantic</str<strong>on</strong>g><br />

<str<strong>on</strong>g>Analysis</str<strong>on</strong>g><br />

<str<strong>on</strong>g>Latent</str<strong>on</strong>g> semantic analysis, or simply LSA, is a method<br />

<strong>to</strong> discern underlying semantic informati<strong>on</strong><br />

from a given corpus <str<strong>on</strong>g>of</str<strong>on</strong>g> text, and <strong>to</strong> subsequently<br />

represent <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>textual meaning <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> words in<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> corpus as a vec<strong>to</strong>r in a high-dimensi<strong>on</strong>al semantic<br />

space (Landauer et al., 1998). As such,<br />

LSA is a powerful method for word sense disambiguati<strong>on</strong>.<br />

The ma<str<strong>on</strong>g>the</str<strong>on</strong>g>matical foundati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> LSA is provided<br />

by <str<strong>on</strong>g>the</str<strong>on</strong>g> Singular Value Decompositi<strong>on</strong>, or<br />

SVD. Initially, a corpus is represented as an <br />

word-passage matrix M, where cell , <br />

represents <str<strong>on</strong>g>the</str<strong>on</strong>g> occurrence <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> -th word in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

-th passage. Thus, each row <str<strong>on</strong>g>of</str<strong>on</strong>g> represents a<br />

word and each column represents a passage. The<br />

SVD is <str<strong>on</strong>g>the</str<strong>on</strong>g>n applied <strong>to</strong> , decomposing it such<br />

that , where is an matrix <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

left singular vec<strong>to</strong>rs, is an matrix <str<strong>on</strong>g>of</str<strong>on</strong>g> right<br />

singular vec<strong>to</strong>rs, and is an matrix c<strong>on</strong>taining<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> singular <str<strong>on</strong>g>value</str<strong>on</strong>g>s <str<strong>on</strong>g>of</str<strong>on</strong>g> .<br />

Crucially, this decompositi<strong>on</strong> fac<strong>to</strong>rs using an<br />

orth<strong>on</strong>ormal basis that produces an optimal reduced<br />

rank approximati<strong>on</strong> matrix (Kalman, 1996).<br />

By reducing dimensi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> matrix irrelevant<br />

informati<strong>on</strong> and noise are removed. The optimal<br />

rank reducti<strong>on</strong> yields useful inducti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> implicit<br />

relati<strong>on</strong>s. However, finding <str<strong>on</strong>g>the</str<strong>on</strong>g> optimal level <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

rank reducti<strong>on</strong> is an empirical issue.<br />

LSA can be applied <strong>to</strong> exploit a parallel corpus<br />

<strong>to</strong> au<strong>to</strong>matically perform bilingual word and c<strong>on</strong>cept<br />

mapping. We define a parallel corpus P as a<br />

set <str<strong>on</strong>g>of</str<strong>on</strong>g> pairs , , where is a document<br />

written in <str<strong>on</strong>g>the</str<strong>on</strong>g> language <str<strong>on</strong>g>of</str<strong>on</strong>g> , and is its translati<strong>on</strong><br />

in <str<strong>on</strong>g>the</str<strong>on</strong>g> language <str<strong>on</strong>g>of</str<strong>on</strong>g> .<br />

Intuitively, we would expect that if <strong>two</strong> words<br />

<br />

and c<strong>on</strong>sistently occur in documents that<br />

are translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r, but not in o<str<strong>on</strong>g>the</str<strong>on</strong>g>r documents,<br />

that <str<strong>on</strong>g>the</str<strong>on</strong>g>y would at <str<strong>on</strong>g>the</str<strong>on</strong>g> very least be semantically<br />

related, and possibly even be translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r. For instance, imagine a parallel corpus<br />

c<strong>on</strong>sisting <str<strong>on</strong>g>of</str<strong>on</strong>g> news articles written in <strong>English</strong> and<br />

Ind<strong>on</strong>esian: in <strong>English</strong> articles where <str<strong>on</strong>g>the</str<strong>on</strong>g> word Japan<br />

occurs, we would expect <str<strong>on</strong>g>the</str<strong>on</strong>g> word Jepang <strong>to</strong><br />

occur in <str<strong>on</strong>g>the</str<strong>on</strong>g> corresp<strong>on</strong>ding Ind<strong>on</strong>esian articles.<br />

This intuiti<strong>on</strong> can be represented in a worddocument<br />

matrix as follows: let be a worddocument<br />

matrix <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>English</strong> documents and <br />

<strong>English</strong> words, and be a word-document matrix<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esian documents and Ind<strong>on</strong>esian<br />

words. The documents are arranged such that, for<br />

1, <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> document represented by<br />

column <str<strong>on</strong>g>of</str<strong>on</strong>g> and <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian document<br />

represented by column <str<strong>on</strong>g>of</str<strong>on</strong>g> form a pair <str<strong>on</strong>g>of</str<strong>on</strong>g> translati<strong>on</strong>s.<br />

Since <str<strong>on</strong>g>the</str<strong>on</strong>g>y are translati<strong>on</strong>s, we can view<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>m as occupying exactly <str<strong>on</strong>g>the</str<strong>on</strong>g> same point in semantic<br />

space, and could just as easily view column<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> both matrices as representing <str<strong>on</strong>g>the</str<strong>on</strong>g> uni<strong>on</strong>, or<br />

c<strong>on</strong>catenati<strong>on</strong>, <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>two</strong> articles.<br />

C<strong>on</strong>sequently, we can c<strong>on</strong>struct <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual<br />

word-document matrix<br />

<br />

<br />

<br />

which is an matrix where cell , <br />

c<strong>on</strong>tains <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> occurrences <str<strong>on</strong>g>of</str<strong>on</strong>g> word in<br />

article . Row i forms <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic vec<strong>to</strong>r <str<strong>on</strong>g>of</str<strong>on</strong>g>, for<br />

, an <strong>English</strong> word, and for , an Indo-


nesian word. C<strong>on</strong>versely, column forms a vec<strong>to</strong>r<br />

representing <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> and Ind<strong>on</strong>esian words<br />

appearing in translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> document .<br />

This approach is similar <strong>to</strong> that <str<strong>on</strong>g>of</str<strong>on</strong>g> (Rehder et al.,<br />

1997; Clodfelder, 2003; Deng and Gao, 2007). The<br />

SVD process is <str<strong>on</strong>g>the</str<strong>on</strong>g> same, while <str<strong>on</strong>g>the</str<strong>on</strong>g> usage is different.<br />

For example, (Rehder et al., 1997) employ<br />

SVD for cross language informati<strong>on</strong> retrieval. On<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, we use it <strong>to</strong> accomplish word and<br />

c<strong>on</strong>cept mappings.<br />

LSA can be applied <strong>to</strong> this bilingual worddocument<br />

matrix. Computing <str<strong>on</strong>g>the</str<strong>on</strong>g> SVD <str<strong>on</strong>g>of</str<strong>on</strong>g> this matrix<br />

and reducing <str<strong>on</strong>g>the</str<strong>on</strong>g> rank should unearth implicit<br />

patterns <str<strong>on</strong>g>of</str<strong>on</strong>g> semantic c<strong>on</strong>cepts. The vec<strong>to</strong>rs<br />

representing <strong>English</strong> and Ind<strong>on</strong>esian words that are<br />

closely related should have high similarity; word<br />

translati<strong>on</strong>s more so.<br />

To approximate <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual word mapping<br />

task, we compare <str<strong>on</strong>g>the</str<strong>on</strong>g> similarity between <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic<br />

vec<strong>to</strong>rs representing words in and .<br />

Specifically, for <str<strong>on</strong>g>the</str<strong>on</strong>g> first rows in which<br />

represent words in , we compute <str<strong>on</strong>g>the</str<strong>on</strong>g>ir similarity<br />

<strong>to</strong> each <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> last rows which represent<br />

words in . Given a large enough corpus, we<br />

would expect all words in and <strong>to</strong> be<br />

represented by rows in .<br />

To approximate <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual c<strong>on</strong>cept mapping<br />

task, we compare <str<strong>on</strong>g>the</str<strong>on</strong>g> similarity between <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic<br />

vec<strong>to</strong>rs representing c<strong>on</strong>cepts in and .<br />

These vec<strong>to</strong>rs can be approximated by first c<strong>on</strong>structing<br />

a set <str<strong>on</strong>g>of</str<strong>on</strong>g> textual c<strong>on</strong>text representing a<br />

c<strong>on</strong>cept . For example, we can include <str<strong>on</strong>g>the</str<strong>on</strong>g> words<br />

in <strong>to</strong>ge<str<strong>on</strong>g>the</str<strong>on</strong>g>r with <str<strong>on</strong>g>the</str<strong>on</strong>g> words from its gloss and<br />

example sentences. The semantic vec<strong>to</strong>r <str<strong>on</strong>g>of</str<strong>on</strong>g> a c<strong>on</strong>cept<br />

is <str<strong>on</strong>g>the</str<strong>on</strong>g>n a weighted average <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic<br />

vec<strong>to</strong>rs <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> words c<strong>on</strong>tained within this c<strong>on</strong>text<br />

set, i.e. rows in . Again, given a large enough<br />

corpus, we would expect enough <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se c<strong>on</strong>text<br />

words <strong>to</strong> be represented by rows in M <strong>to</strong> form an<br />

adequate semantic vec<strong>to</strong>r for <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cept .<br />

4 Experiments<br />

4.1 Existing Resources<br />

For <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> lexic<strong>on</strong>, we used <str<strong>on</strong>g>the</str<strong>on</strong>g> most current<br />

versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> WordNet (Fellbaum, 1998), versi<strong>on</strong><br />

3.0 3 . For each <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 117659 distinct synsets, we<br />

<strong>on</strong>ly use <str<strong>on</strong>g>the</str<strong>on</strong>g> following data: <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> words be-<br />

3 More specifically, <str<strong>on</strong>g>the</str<strong>on</strong>g> SQL versi<strong>on</strong> available from<br />

http://wnsqlbuilder.sourceforge.net<br />

l<strong>on</strong>ging <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> synset, <str<strong>on</strong>g>the</str<strong>on</strong>g> gloss, and example sentences,<br />

if any. The uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se resources yields<br />

a set 169583 unique words.<br />

For <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian lexic<strong>on</strong>, we used an electr<strong>on</strong>ic<br />

versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI developed at <str<strong>on</strong>g>the</str<strong>on</strong>g> University<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia. For each <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 85521 distinct word<br />

sense definiti<strong>on</strong>s, we use <str<strong>on</strong>g>the</str<strong>on</strong>g> following data: <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

list <str<strong>on</strong>g>of</str<strong>on</strong>g> sublemmas, i.e. inflected forms, al<strong>on</strong>g with<br />

gloss and example sentences, if any. The uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>se resources yields a set <str<strong>on</strong>g>of</str<strong>on</strong>g> 87171 unique words.<br />

Our main parallel corpus c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> 3273 <strong>English</strong><br />

and Ind<strong>on</strong>esian article pairs taken from <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

ANTARA news agency. This collecti<strong>on</strong> was developed<br />

by Mirna Adriani and M<strong>on</strong>ica Lestari Paramita<br />

at <str<strong>on</strong>g>the</str<strong>on</strong>g> Informati<strong>on</strong> Retrieval Lab, University<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia 4 .<br />

A bilingual <strong>English</strong>-Ind<strong>on</strong>esia dicti<strong>on</strong>ary was<br />

c<strong>on</strong>structed using various <strong>on</strong>line resources, including<br />

a handcrafted dicti<strong>on</strong>ary by Hantar<strong>to</strong> Widjaja 5 ,<br />

kamus.net, and Trans<strong>to</strong>ol v6.1, a commercial translati<strong>on</strong><br />

system. In <strong>to</strong>tal, this dicti<strong>on</strong>ary maps 37678<br />

unique <strong>English</strong> words <strong>to</strong> 60564 unique Ind<strong>on</strong>esian<br />

words.<br />

4.2 Bilingual Word Mapping<br />

Our experiment with bilingual word mapping was<br />

set up as follows: firstly, we define a collecti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

article pairs derived from <str<strong>on</strong>g>the</str<strong>on</strong>g> ANTARA collecti<strong>on</strong>,<br />

and from it we set up a bilingual word-document<br />

matrix (see Secti<strong>on</strong> 3). The LSA process is subsequently<br />

applied <strong>on</strong> this matrix, i.e. we first compute<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> SVD <str<strong>on</strong>g>of</str<strong>on</strong>g> this matrix, and <str<strong>on</strong>g>the</str<strong>on</strong>g>n use it <strong>to</strong><br />

compute <str<strong>on</strong>g>the</str<strong>on</strong>g> optimal -rank approximati<strong>on</strong>. Finally,<br />

based <strong>on</strong> this approximati<strong>on</strong>, for a randomly<br />

chosen set <str<strong>on</strong>g>of</str<strong>on</strong>g> vec<strong>to</strong>rs representing <strong>English</strong> words,<br />

we compute <str<strong>on</strong>g>the</str<strong>on</strong>g> nearest vec<strong>to</strong>rs representing <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

most similar Ind<strong>on</strong>esian words. This is c<strong>on</strong>venti<strong>on</strong>ally<br />

computed using <str<strong>on</strong>g>the</str<strong>on</strong>g> cosine <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> angle<br />

between <strong>two</strong> vec<strong>to</strong>rs.<br />

Within this general framework, <str<strong>on</strong>g>the</str<strong>on</strong>g>re are several<br />

variables that we experiment with, as follows:<br />

• Collecti<strong>on</strong> size. Three subsets <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> parallel<br />

corpus were randomly created: P 100 c<strong>on</strong>tains<br />

100 article pairs, P 500 c<strong>on</strong>tains 500 article<br />

pairs, and P 1000 c<strong>on</strong>tains 1000 article pairs.<br />

Each subsequent subset wholly c<strong>on</strong>tains <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

previous subsets, i.e. P 100 ⊂ P 500 ⊂ P 1000 .<br />

4 publicati<strong>on</strong> forthcoming<br />

5 http://hantar<strong>to</strong>.definiti<strong>on</strong>roadsafety.org


• Rank reducti<strong>on</strong>. For each collecti<strong>on</strong>, we applied<br />

LSA with different degrees <str<strong>on</strong>g>of</str<strong>on</strong>g> rank approximati<strong>on</strong>,<br />

namely 10%, 25%, and 50% <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

number <str<strong>on</strong>g>of</str<strong>on</strong>g> dimensi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> original collecti<strong>on</strong>.<br />

Thus, for P 100 we compute <str<strong>on</strong>g>the</str<strong>on</strong>g> 10, 25, and<br />

50-rank approximati<strong>on</strong>s, for P 500 we compute<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> 50, 125, and 250-rank approximati<strong>on</strong>s, and<br />

for P 1000 we compute <str<strong>on</strong>g>the</str<strong>on</strong>g> 100, 250, and 500-<br />

rank approximati<strong>on</strong>s.<br />

• Removal <str<strong>on</strong>g>of</str<strong>on</strong>g> s<strong>to</strong>pwords. S<strong>to</strong>pwords are words<br />

that appear numerously in a text, thus are assumed<br />

as insignificant <strong>to</strong> represent <str<strong>on</strong>g>the</str<strong>on</strong>g> specific<br />

c<strong>on</strong>text <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> text. It is a comm<strong>on</strong> technique<br />

used <strong>to</strong> improve performance <str<strong>on</strong>g>of</str<strong>on</strong>g> informati<strong>on</strong><br />

retrieval systems. It is applied in preprocessing<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong>s, i.e. removing all instances <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> s<strong>to</strong>pwords in <str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong>s before applying<br />

LSA.<br />

• Weighting. Two weighting schemes, namely<br />

TF-IDF and Log-Entropy, were applied <strong>to</strong> a<br />

word-document matrix separately.<br />

• Mapping Selecti<strong>on</strong>. For computing <str<strong>on</strong>g>the</str<strong>on</strong>g> precisi<strong>on</strong><br />

and recall <str<strong>on</strong>g>value</str<strong>on</strong>g>s, we experimented with<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> mapping results <strong>to</strong> c<strong>on</strong>sider: <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

<strong>to</strong>p 1, 10, 50, and 100 mappings based <strong>on</strong> similarity<br />

were taken.<br />

film<br />

filmnya<br />

sutradara<br />

garapan<br />

perfilman<br />

penayangan<br />

k<strong>on</strong>troversial<br />

koboi<br />

irasi<strong>on</strong>al<br />

frase<br />

(a)<br />

0.814<br />

0.698<br />

0.684<br />

0.581<br />

0.554<br />

0.544<br />

0.526<br />

0.482<br />

0.482<br />

0.482<br />

pembebanan<br />

kijang<br />

halmahera<br />

alumina<br />

terjadwal<br />

viskositas<br />

tabel<br />

royalti<br />

reklamasi<br />

penyimpan<br />

(b)<br />

0.973<br />

0.973<br />

0.973<br />

0.973<br />

0.973<br />

0.973<br />

0.973<br />

0.973<br />

0.973<br />

0.973<br />

Table 2. The Most 10 Similar Ind<strong>on</strong>esian Words for<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> Words (a) Film and (b) Billi<strong>on</strong><br />

As an example, Table 2 presents <str<strong>on</strong>g>the</str<strong>on</strong>g> results <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<br />

<br />

mapping and , i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>two</strong> <strong>English</strong><br />

words film and billi<strong>on</strong>, respectively <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir Ind<strong>on</strong>esian<br />

translati<strong>on</strong>s, using <str<strong>on</strong>g>the</str<strong>on</strong>g> P 1000 training collecti<strong>on</strong><br />

with 500-rank approximati<strong>on</strong>. No weighting<br />

was applied. The former shows a successful mapping,<br />

while <str<strong>on</strong>g>the</str<strong>on</strong>g> latter shows an unsuccessful <strong>on</strong>e.<br />

<br />

Bilingual LSA correctly maps <strong>to</strong> its translati<strong>on</strong>,<br />

, despite <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that <str<strong>on</strong>g>the</str<strong>on</strong>g>y are treated as<br />

<br />

separate elements, i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g>ir shared orthography is<br />

completely coincidental. Additi<strong>on</strong>ally, <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />

Ind<strong>on</strong>esian words it suggests are semantically related,<br />

e.g. sutradara (direc<strong>to</strong>r), garapan (creati<strong>on</strong>),<br />

penayangan (screening), etc. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand,<br />

<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> suggested word mappings for are incorrect,<br />

and <str<strong>on</strong>g>the</str<strong>on</strong>g> correct translati<strong>on</strong>, milyar, is missing.<br />

We suspect this may be due <strong>to</strong> several fac<strong>to</strong>rs.<br />

Firstly, billi<strong>on</strong> does not by itself invoke a particular<br />

semantic frame, and thus its semantic vec<strong>to</strong>r might<br />

not suggest a specific c<strong>on</strong>ceptual domain. Sec<strong>on</strong>dly,<br />

billi<strong>on</strong> can sometimes be translated numerically<br />

instead <str<strong>on</strong>g>of</str<strong>on</strong>g> lexically. Lastly, this failure may also be<br />

due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> lack <str<strong>on</strong>g>of</str<strong>on</strong>g> data: <str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong> is simply <strong>to</strong>o<br />

small <strong>to</strong> provide useful statistics that represent semantic<br />

c<strong>on</strong>text. Similar LSA approaches are comm<strong>on</strong>ly<br />

trained <strong>on</strong> collecti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> text numbering in<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> tens <str<strong>on</strong>g>of</str<strong>on</strong>g> thousands <str<strong>on</strong>g>of</str<strong>on</strong>g> articles.<br />

Note as well that <str<strong>on</strong>g>the</str<strong>on</strong>g> absolute vec<strong>to</strong>r cosine <str<strong>on</strong>g>value</str<strong>on</strong>g>s<br />

do not accurately reflect <str<strong>on</strong>g>the</str<strong>on</strong>g> correctness <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

word translati<strong>on</strong>s. To properly assess <str<strong>on</strong>g>the</str<strong>on</strong>g> results <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

this experiment, evaluati<strong>on</strong> against a gold standard<br />

is necessary. This is achieved by comparing its<br />

precisi<strong>on</strong> and recall against <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian words<br />

returned by <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual dicti<strong>on</strong>ary, i.e. how isomorphic<br />

is <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> LSA-derived word mappings<br />

with a human-authored set <str<strong>on</strong>g>of</str<strong>on</strong>g> word mappings?<br />

We provide a baseline as comparis<strong>on</strong>, which<br />

computes <str<strong>on</strong>g>the</str<strong>on</strong>g> nearness between <strong>English</strong> and Ind<strong>on</strong>esian<br />

words <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> original word-document occurrence<br />

frequency matrix. O<str<strong>on</strong>g>the</str<strong>on</strong>g>r approaches are<br />

possible, e.g. mutual informati<strong>on</strong> (Sari, 2007).<br />

Table 3(a)-(e) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> different aspects <str<strong>on</strong>g>of</str<strong>on</strong>g> our<br />

experiment results by averaging <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r variables.<br />

Table 3(a) c<strong>on</strong>firms our intuiti<strong>on</strong> that as<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong> size increases, <str<strong>on</strong>g>the</str<strong>on</strong>g> precisi<strong>on</strong> and recall<br />

<str<strong>on</strong>g>value</str<strong>on</strong>g>s also increase. Table 3(b) presents <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

effects <str<strong>on</strong>g>of</str<strong>on</strong>g> rank approximati<strong>on</strong>. It shows that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

higher <str<strong>on</strong>g>the</str<strong>on</strong>g> rank approximati<strong>on</strong> percentage, <str<strong>on</strong>g>the</str<strong>on</strong>g> better<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> mapping results. Note that a rank approximati<strong>on</strong><br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> 100% is equal <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> FREQ baseline <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

simply using <str<strong>on</strong>g>the</str<strong>on</strong>g> full-rank word-document matrix<br />

for computing vec<strong>to</strong>r space nearness. Table 3(c)<br />

suggests that s<strong>to</strong>pwords seem <strong>to</strong> help LSA <strong>to</strong> yield<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> correct mappings. It is believed that s<strong>to</strong>pwords<br />

are not bounded by semantic domains, thus do not<br />

carry any semantic bias. However, <strong>on</strong> account <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> small size <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong>, in coincidence,<br />

s<strong>to</strong>pwords, which c<strong>on</strong>sistently appear in a specific<br />

domain, may carry some semantic informati<strong>on</strong><br />

about <str<strong>on</strong>g>the</str<strong>on</strong>g> domain. Table 3(d) compares <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping<br />

results in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> weighting usage. It sug-


gests that weighting can improve <str<strong>on</strong>g>the</str<strong>on</strong>g> mappings.<br />

Additi<strong>on</strong>ally, Log-Entropy weighting yields <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

highest results. Table 3(e) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> comparis<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

mapping selecti<strong>on</strong>s. As <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> translati<strong>on</strong><br />

pairs selected increases, <str<strong>on</strong>g>the</str<strong>on</strong>g> precisi<strong>on</strong> <str<strong>on</strong>g>value</str<strong>on</strong>g> decreases.<br />

On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, as <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> translati<strong>on</strong><br />

pairs selected increases, <str<strong>on</strong>g>the</str<strong>on</strong>g> possibility <strong>to</strong><br />

find more pairs matching <str<strong>on</strong>g>the</str<strong>on</strong>g> pairs in bilingual dicti<strong>on</strong>ary<br />

increases. Thus, <str<strong>on</strong>g>the</str<strong>on</strong>g> recall <str<strong>on</strong>g>value</str<strong>on</strong>g> increases<br />

as well.<br />

Collecti<strong>on</strong> Size<br />

FREQ<br />

LSA<br />

P R P R<br />

0.0668 0.1840 0.0346 0.1053<br />

0.1301 0.2761 0.0974 0.2368<br />

0.1467 0.2857 0.1172 0.2603<br />

(a)<br />

Rank Approximati<strong>on</strong> P R<br />

10% 0.0680 0.1727<br />

25% 0.0845 0.2070<br />

50% 0.0967 0.2226<br />

100% 0.1009 0.2285<br />

(b)<br />

S<strong>to</strong>pwords<br />

FREQ<br />

LSA<br />

P R P R<br />

C<strong>on</strong>tained 0.1108 0.2465 0.0840 0.2051<br />

Removed 0.1138 0.2440 0.0822 0.1964<br />

(c)<br />

Weighting Usage<br />

FREQ<br />

LSA<br />

P R P R<br />

No Weighting 0.1009 0.2285 0.0757 0.1948<br />

Log-Entropy 0.1347 0.2753 0.1041 0.2274<br />

TF-IDF 0.1013 0.2319 0.0694 0.1802<br />

(d)<br />

Mapping<br />

FREQ<br />

LSA<br />

Selecti<strong>on</strong> P R P R<br />

Top 1 0.3758 0.1588 0.2380 0.0987<br />

Top 10 0.0567 0.2263 0.0434 0.1733<br />

Top 50 0.0163 0.2911 0.0133 0.2338<br />

Top 100 0.0094 0.3183 0.0081 0.2732<br />

(e)<br />

Table 3. Results <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual word mapping comparing<br />

(a) collecti<strong>on</strong> size, (b) rank approximati<strong>on</strong>, (c)<br />

removal <str<strong>on</strong>g>of</str<strong>on</strong>g> s<strong>to</strong>pwords, (d) weighting schemes, and (e)<br />

mapping selecti<strong>on</strong><br />

Most interestingly, however, is <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

FREQ baseline, which uses <str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space<br />

model, c<strong>on</strong>sistently outperforms LSA.<br />

4.3 Bilingual C<strong>on</strong>cept Mapping<br />

Using <str<strong>on</strong>g>the</str<strong>on</strong>g> same resources from <str<strong>on</strong>g>the</str<strong>on</strong>g> previous expe-<br />

riment, we ran an experiment <strong>to</strong> perform bilingual<br />

c<strong>on</strong>cept mapping by replacing <str<strong>on</strong>g>the</str<strong>on</strong>g> vec<strong>to</strong>rs <strong>to</strong> be<br />

compared with semantic vec<strong>to</strong>rs for c<strong>on</strong>cepts (see<br />

Secti<strong>on</strong> 3). For c<strong>on</strong>cept , i.e. a WordNet<br />

synset, we c<strong>on</strong>structed a set <str<strong>on</strong>g>of</str<strong>on</strong>g> textual c<strong>on</strong>text as<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> , <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> words in <str<strong>on</strong>g>the</str<strong>on</strong>g> gloss <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

, and <str<strong>on</strong>g>the</str<strong>on</strong>g> set <str<strong>on</strong>g>of</str<strong>on</strong>g> words in <str<strong>on</strong>g>the</str<strong>on</strong>g> example sentences<br />

associated with . To represent our intuiti<strong>on</strong> that<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> words in played more <str<strong>on</strong>g>of</str<strong>on</strong>g> an important<br />

role in defining <str<strong>on</strong>g>the</str<strong>on</strong>g> semantic vec<strong>to</strong>r than <str<strong>on</strong>g>the</str<strong>on</strong>g> words<br />

in <str<strong>on</strong>g>the</str<strong>on</strong>g> gloss and example, we applied a weight <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

60%, 30%, and 10% <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> three comp<strong>on</strong>ents, respectively.<br />

Similarly, a semantic vec<strong>to</strong>r representing<br />

a c<strong>on</strong>cept , i.e. an Ind<strong>on</strong>esian word<br />

sense in <str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI, was c<strong>on</strong>structed from a textual<br />

c<strong>on</strong>text set composed <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> sublemma, <str<strong>on</strong>g>the</str<strong>on</strong>g> definiti<strong>on</strong>,<br />

and <str<strong>on</strong>g>the</str<strong>on</strong>g> example <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> word sense, using <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

same weightings. We <strong>on</strong>ly average word vec<strong>to</strong>rs if<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>y appear in <str<strong>on</strong>g>the</str<strong>on</strong>g> collecti<strong>on</strong> (depending <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

experimental variables used).<br />

We formulated an experiment which closely resembles<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> word sense disambiguati<strong>on</strong> problem:<br />

given a WordNet synset, <str<strong>on</strong>g>the</str<strong>on</strong>g> task is <strong>to</strong> select <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

most appropriate Ind<strong>on</strong>esian sense from a subset <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

senses that have been selected based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />

words appearing in our bilingual dicti<strong>on</strong>ary. These<br />

specific senses are called suggesti<strong>on</strong>s. Thus, instead<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> comparing <str<strong>on</strong>g>the</str<strong>on</strong>g> vec<strong>to</strong>r representing communicati<strong>on</strong><br />

with every single Ind<strong>on</strong>esian sense in<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI, in this task we <strong>on</strong>ly compare it against<br />

suggesti<strong>on</strong>s with a limited range <str<strong>on</strong>g>of</str<strong>on</strong>g> sublemmas,<br />

e.g. komunikasi, perhubungan, hubungan, etc.<br />

This setup is thus identical <strong>to</strong> that <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>on</strong>going<br />

experiment here <strong>to</strong> manually map WordNet synsets<br />

<strong>to</strong> KBBI senses. C<strong>on</strong>sequently, this facilitates assessment<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> results by computing <str<strong>on</strong>g>the</str<strong>on</strong>g> level <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

agreement between <str<strong>on</strong>g>the</str<strong>on</strong>g> LSA-based mappings with<br />

human annotati<strong>on</strong>s.<br />

To illustrate, Table 4(a) and 4(b) presents a successful<br />

and unsuccessful example <str<strong>on</strong>g>of</str<strong>on</strong>g> mapping a<br />

WordNet synset. For each example we show <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

synset ID and <str<strong>on</strong>g>the</str<strong>on</strong>g> ideal textual c<strong>on</strong>text set, i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

set <str<strong>on</strong>g>of</str<strong>on</strong>g> words that c<strong>on</strong>vey <str<strong>on</strong>g>the</str<strong>on</strong>g> synset, its gloss and<br />

example sentences. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n show <str<strong>on</strong>g>the</str<strong>on</strong>g> actual textual<br />

c<strong>on</strong>text set with <str<strong>on</strong>g>the</str<strong>on</strong>g> notati<strong>on</strong> {{X}, {Y}, {Z}},<br />

where X, Y , and Z are <str<strong>on</strong>g>the</str<strong>on</strong>g> subset <str<strong>on</strong>g>of</str<strong>on</strong>g> words that<br />

appear in <str<strong>on</strong>g>the</str<strong>on</strong>g> training collecti<strong>on</strong>. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n show <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

Ind<strong>on</strong>esian word sense deemed <strong>to</strong> be most similar.<br />

For each sense we show <str<strong>on</strong>g>the</str<strong>on</strong>g> vec<strong>to</strong>r similarity score,<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI ID and its ideal textual c<strong>on</strong>text set, i.e.<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> sublemma, its definiti<strong>on</strong> and example sen-


tences. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n show <str<strong>on</strong>g>the</str<strong>on</strong>g> actual textual c<strong>on</strong>text set<br />

with <str<strong>on</strong>g>the</str<strong>on</strong>g> same notati<strong>on</strong> as above.<br />

WordNet Synset ID: 100319939, Words: chase, following,<br />

pursual, pursuit, Gloss: <str<strong>on</strong>g>the</str<strong>on</strong>g> act <str<strong>on</strong>g>of</str<strong>on</strong>g> pursuing in an effort <strong>to</strong> overtake<br />

or capture, Example: <str<strong>on</strong>g>the</str<strong>on</strong>g> culprit started <strong>to</strong> run and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

cop <strong>to</strong>ok <str<strong>on</strong>g>of</str<strong>on</strong>g>f in pursuit, Textual c<strong>on</strong>text set: {{following,<br />

chase}, {<str<strong>on</strong>g>the</str<strong>on</strong>g>, effort, <str<strong>on</strong>g>of</str<strong>on</strong>g>, <strong>to</strong>, or, capture, in, act, pursuing, an},<br />

{<str<strong>on</strong>g>the</str<strong>on</strong>g>, <str<strong>on</strong>g>of</str<strong>on</strong>g>f, <strong>to</strong>ok, <strong>to</strong>, run, in, culprit, started, and}}<br />

KBBI ID: k39607 - Similarity: 0.804, Sublemma: mengejar,<br />

Definiti<strong>on</strong>: berlari untuk menyusul menangkap dsb memburu,<br />

Example: ia berusaha mengejar dan menangkap saya,<br />

Textual c<strong>on</strong>text set: {{mengejar}, {memburu, berlari, menangkap,<br />

untuk, menyusul},{berusaha, dan, ia, mengejar,<br />

saya, menangkap}}<br />

(a)<br />

WordNet synset ID: 201277784, Words: crease, furrow,<br />

wrinkle<br />

Gloss: make wrinkled or creased, Example: furrow <strong>on</strong>e’s<br />

brow,<br />

Textual c<strong>on</strong>text set: {{}, {or, make}, {s, <strong>on</strong>e}}<br />

KBBI ID: k02421 - Similarity: 0.69, Sublemma: alur, Definiti<strong>on</strong>:<br />

jalinan peristiwa dl karya sastra untuk mencapai efek<br />

tertentu pautannya dapat diwujudkan oleh hubungan temporal<br />

atau waktu dan oleh hubungan kausal atau sebab-akibat, Example:<br />

(n<strong>on</strong>e), Textual c<strong>on</strong>text set: {{alur}, {oleh, dan, atau,<br />

jalinan, peristiwa, diwujudkan, efek, dapat, karya, hubungan,<br />

waktu, mencapai, untuk, tertentu}, {}}<br />

(b)<br />

Table 4. Example <str<strong>on</strong>g>of</str<strong>on</strong>g> (a) Successful and (b) Unsuccessful<br />

C<strong>on</strong>cept Mappings<br />

In <str<strong>on</strong>g>the</str<strong>on</strong>g> first example, <str<strong>on</strong>g>the</str<strong>on</strong>g> textual c<strong>on</strong>text sets<br />

from both <str<strong>on</strong>g>the</str<strong>on</strong>g> WordNet synset and <str<strong>on</strong>g>the</str<strong>on</strong>g> KBBI<br />

senses are fairly large, and provide sufficient c<strong>on</strong>text<br />

for LSA <strong>to</strong> choose <str<strong>on</strong>g>the</str<strong>on</strong>g> correct KBBI sense.<br />

However, in <str<strong>on</strong>g>the</str<strong>on</strong>g> sec<strong>on</strong>d example, <str<strong>on</strong>g>the</str<strong>on</strong>g> textual c<strong>on</strong>text<br />

set for <str<strong>on</strong>g>the</str<strong>on</strong>g> synset is very small, due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

words not appearing in <str<strong>on</strong>g>the</str<strong>on</strong>g> training collecti<strong>on</strong>. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore,<br />

it does not c<strong>on</strong>tain any <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> words that<br />

truly c<strong>on</strong>vey <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cept. As a result, LSA is unable<br />

<strong>to</strong> identify <str<strong>on</strong>g>the</str<strong>on</strong>g> correct KBBI sense.<br />

For this experiment, we used <str<strong>on</strong>g>the</str<strong>on</strong>g> P 1000 training<br />

collecti<strong>on</strong>. The results are presented in Table 5. As<br />

a baseline, we select three random suggested Ind<strong>on</strong>esian<br />

word senses as a mapping for an <strong>English</strong><br />

word sense. The reported random baseline in Table<br />

5 is an average <str<strong>on</strong>g>of</str<strong>on</strong>g> 10 separate runs. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r baseline<br />

was computed by comparing <strong>English</strong> comm<strong>on</strong>-based<br />

c<strong>on</strong>cepts <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir suggesti<strong>on</strong> based <strong>on</strong> a<br />

full rank word-document matrix. Top 3 Ind<strong>on</strong>esian<br />

c<strong>on</strong>cepts with <str<strong>on</strong>g>the</str<strong>on</strong>g> highest similarity <str<strong>on</strong>g>value</str<strong>on</strong>g>s are designated<br />

as <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping results. Subsequently, we<br />

compute <str<strong>on</strong>g>the</str<strong>on</strong>g> Fleiss kappa (Fleiss, 1971) <str<strong>on</strong>g>of</str<strong>on</strong>g> this result<br />

<strong>to</strong>ge<str<strong>on</strong>g>the</str<strong>on</strong>g>r with <str<strong>on</strong>g>the</str<strong>on</strong>g> human judgements.<br />

The average level <str<strong>on</strong>g>of</str<strong>on</strong>g> agreement between <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

LSA mappings 10% and <str<strong>on</strong>g>the</str<strong>on</strong>g> human judges<br />

(0.2713) is not as high as between <str<strong>on</strong>g>the</str<strong>on</strong>g> human<br />

judges <str<strong>on</strong>g>the</str<strong>on</strong>g>mselves (0.4831). Never<str<strong>on</strong>g>the</str<strong>on</strong>g>less, in general<br />

it is better than <str<strong>on</strong>g>the</str<strong>on</strong>g> random baseline (0.2380)<br />

and frequency baseline (0.2132), which suggests<br />

that LSA is indeed managing <strong>to</strong> capture some<br />

measure <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual semantic informati<strong>on</strong> implicit<br />

within <str<strong>on</strong>g>the</str<strong>on</strong>g> parallel corpus.<br />

Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, LSA mappings with 10% rank approximati<strong>on</strong><br />

yields higher levels <str<strong>on</strong>g>of</str<strong>on</strong>g> agreement than<br />

LSA with o<str<strong>on</strong>g>the</str<strong>on</strong>g>r rank approximati<strong>on</strong>s. It is c<strong>on</strong>tradic<strong>to</strong>ry<br />

with <str<strong>on</strong>g>the</str<strong>on</strong>g> word mapping results where LSA<br />

with bigger rank approximati<strong>on</strong>s yields higher results<br />

(Secti<strong>on</strong> 4.2).<br />

5 Discussi<strong>on</strong><br />

Previous works have shown LSA <strong>to</strong> c<strong>on</strong>tribute<br />

positive gains <strong>to</strong> similar tasks such as Cross Language<br />

Informati<strong>on</strong> Retrieval (Rehder et al., 1997).<br />

However, <str<strong>on</strong>g>the</str<strong>on</strong>g> bilingual word mapping results presented<br />

in Secti<strong>on</strong> 4.3 show <str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space<br />

model c<strong>on</strong>sistently outperforming LSA at that particular<br />

task, despite our initial intuiti<strong>on</strong> that LSA<br />

should actually improve precisi<strong>on</strong> and recall.<br />

We speculate that <str<strong>on</strong>g>the</str<strong>on</strong>g> task <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual word<br />

mapping may be even harder for LSA than that <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

Judges<br />

Synsets<br />

Judges <strong>on</strong>ly<br />

Judges +<br />

RNDM3<br />

Fleiss Kappa Values<br />

Judges + Judges +<br />

FREQ Top LSA 10%<br />

3<br />

Top3<br />

Judges +<br />

LSA 25%<br />

Top3<br />

Judges +<br />

LSA 50%<br />

Top3<br />

≥ 2 144 0.4269 0.1318 0.1667 0.1544 0.1606 0.1620<br />

≥ 3 24 0.4651 0.2197 0.2282 0.2334 0.2239 0.2185<br />

≥ 4 8 0.5765 0.3103 0.2282 0.3615 0.3329 0.3329<br />

≥ 5 4 0.4639 0.2900 0.2297 0.3359 0.3359 0.3359<br />

Average 0.4831 0.2380 0.2132 0.2713 0.2633 0.2623<br />

Table 5. Results <str<strong>on</strong>g>of</str<strong>on</strong>g> C<strong>on</strong>cept Mapping


ilingual c<strong>on</strong>cept mapping due <strong>to</strong> its finer alignment<br />

granularity. While c<strong>on</strong>cept mapping attempts<br />

<strong>to</strong> map a c<strong>on</strong>cept c<strong>on</strong>veyed by a group <str<strong>on</strong>g>of</str<strong>on</strong>g> semantically<br />

related words, word mapping attempts <strong>to</strong> map<br />

a word with a specific meaning <strong>to</strong> its translati<strong>on</strong> in<br />

ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r language.<br />

In <str<strong>on</strong>g>the</str<strong>on</strong>g>ory, LSA employs rank reducti<strong>on</strong> <strong>to</strong> remove<br />

noise and <strong>to</strong> reveal underlying informati<strong>on</strong><br />

c<strong>on</strong>tained in a corpus. LSA has a ‘smoothing’ effect<br />

<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> matrix, which is useful <strong>to</strong> discover general<br />

patterns, e.g. clustering documents by semantic<br />

domain. Our experiment results, however, generally<br />

shows <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency baseline, which employs<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> full rank word-document matrix, outperforming<br />

LSA.<br />

We speculate that <str<strong>on</strong>g>the</str<strong>on</strong>g> rank reducti<strong>on</strong> perhaps<br />

blurs some crucial details necessary for word mapping.<br />

The frequency baseline seems <strong>to</strong> encode<br />

more cooccurrence than LSA. It compares word<br />

vec<strong>to</strong>rs between <strong>English</strong> and Ind<strong>on</strong>esian that c<strong>on</strong>tain<br />

pure frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> word occurrence in each<br />

document. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, LSA encodes more<br />

semantic relatedness. It compares <strong>English</strong> and Ind<strong>on</strong>esian<br />

word vec<strong>to</strong>rs c<strong>on</strong>taining estimates <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

word frequency in documents according <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

c<strong>on</strong>text meaning. Since <str<strong>on</strong>g>the</str<strong>on</strong>g> purpose <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual<br />

word mapping is <strong>to</strong> obtain proper translati<strong>on</strong>s for<br />

an <strong>English</strong> word, it may be better explained as an<br />

issue <str<strong>on</strong>g>of</str<strong>on</strong>g> cooccurrence ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than semantic relatedness.<br />

That is, <str<strong>on</strong>g>the</str<strong>on</strong>g> higher <str<strong>on</strong>g>the</str<strong>on</strong>g> rate <str<strong>on</strong>g>of</str<strong>on</strong>g> cooccurrence<br />

between an <strong>English</strong> and an Ind<strong>on</strong>esian word, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

likelier <str<strong>on</strong>g>the</str<strong>on</strong>g>y are <strong>to</strong> be translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r.<br />

LSA may yield better results in <str<strong>on</strong>g>the</str<strong>on</strong>g> case <str<strong>on</strong>g>of</str<strong>on</strong>g> finding<br />

words with similar semantic domains. Thus,<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> LSA mapping results should be better assessed<br />

using a resource listing semantically related terms,<br />

ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than using a bilingual dicti<strong>on</strong>ary listing<br />

translati<strong>on</strong> pairs. A bilingual dicti<strong>on</strong>ary demands<br />

more specific c<strong>on</strong>straints than semantic relatedness,<br />

as it specifies that <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping results should<br />

be <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>English</strong> word.<br />

Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, polysemous terms may become<br />

ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r problem for LSA. By rank approximati<strong>on</strong>,<br />

LSA estimates <str<strong>on</strong>g>the</str<strong>on</strong>g> occurrence frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> a word<br />

in a particular document. Since polysemy <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>English</strong><br />

terms and Ind<strong>on</strong>esian terms can be quite different,<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> estimati<strong>on</strong>s for words which are mutual<br />

translati<strong>on</strong>s can be different. For instance, kali and<br />

waktu are Ind<strong>on</strong>esian translati<strong>on</strong>s for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong><br />

word time. However, kali is also <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian<br />

translati<strong>on</strong> for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> word river. Suppose<br />

kali and time appear frequently in documents<br />

about multiplicati<strong>on</strong>, but kali and river appear<br />

rarely in documents about river. Then, waktu and<br />

time appear frequently in documents about time.<br />

As a result, LSA may estimate kali with greater<br />

frequency in documents about multiplicati<strong>on</strong> and<br />

time, but with lower frequency in documents about<br />

river. The word vec<strong>to</strong>rs between kali and river<br />

may not be similar. Thus, in bilingual word mapping,<br />

LSA may not suggest kali as <str<strong>on</strong>g>the</str<strong>on</strong>g> proper<br />

translati<strong>on</strong> for river. Although polysemous words<br />

can also be a problem for <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency baseline,<br />

it merely uses raw word frequency vec<strong>to</strong>rs, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

problem does not affect o<str<strong>on</strong>g>the</str<strong>on</strong>g>r word vec<strong>to</strong>rs. LSA,<br />

<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, exacerbates this problem by taking<br />

it in<strong>to</strong> account in estimating o<str<strong>on</strong>g>the</str<strong>on</strong>g>r word frequencies.<br />

6 Summary<br />

We have presented a model <str<strong>on</strong>g>of</str<strong>on</strong>g> computing bilingual<br />

word and c<strong>on</strong>cept mappings between <strong>two</strong> semantic<br />

lexic<strong>on</strong>s, in our case Princet<strong>on</strong> WordNet and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

KBBI, using an extensi<strong>on</strong> <strong>to</strong> LSA that exploits implicit<br />

semantic informati<strong>on</strong> c<strong>on</strong>tained within a parallel<br />

corpus.<br />

The results, whilst far from c<strong>on</strong>clusive, indicate<br />

that that for bilingual word mapping, LSA is c<strong>on</strong>sistently<br />

outperformed by <str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space<br />

model, i.e. where <str<strong>on</strong>g>the</str<strong>on</strong>g> full-rank word-document matrix<br />

is applied, whereas for bilingual c<strong>on</strong>cept mapping<br />

LSA seems <strong>to</strong> slightly improve results. We<br />

speculate that this is due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that LSA,<br />

whilst very useful for revealing implicit semantic<br />

patterns, blurs <str<strong>on</strong>g>the</str<strong>on</strong>g> cooccurrence informati<strong>on</strong> that is<br />

necessary for establishing word translati<strong>on</strong>s.<br />

We suggest that, particularly for bilingual word<br />

mapping, a finer granularity <str<strong>on</strong>g>of</str<strong>on</strong>g> alignment, e.g. at<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> sentential level, may increase accuracy (Deng<br />

and Gao, 2007).<br />

Acknowledgment<br />

The work presented in this paper is supported by<br />

an RUUI (Riset Unggulan Universitas Ind<strong>on</strong>esia)<br />

2007 research grant from DRPM UI (Direk<strong>to</strong>rat<br />

Riset dan Pengabdian Masyarakat Universitas Ind<strong>on</strong>esia).<br />

We would also like <strong>to</strong> thank Franky for<br />

help in s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware implementati<strong>on</strong> and Desm<strong>on</strong>d<br />

Darma Putra for help in computing <str<strong>on</strong>g>the</str<strong>on</strong>g> Fleiss kappa<br />

<str<strong>on</strong>g>value</str<strong>on</strong>g>s in Secti<strong>on</strong> 4.3.


References<br />

Eneko Agirre and Philip Edm<strong>on</strong>ds, edi<strong>to</strong>rs. 2007. Word<br />

Sense Disambiguati<strong>on</strong>: Algorithms and Applicati<strong>on</strong>s.<br />

Springer.<br />

Katri A. Clodfelder. 2003. An LSA Implementati<strong>on</strong><br />

Against Parallel Texts in French and <strong>English</strong>. In<br />

Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> HLT-NAACL Workshop: Building<br />

and Using Parallel Texts: Data Driven Machine<br />

Translati<strong>on</strong> and Bey<strong>on</strong>d, 111–114.<br />

Y<strong>on</strong>ggang Deng and Yuqing Gao. June 2007. Guiding<br />

statistical word alignment models with prior knowledge.<br />

In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 45 th Annual Meeting <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> Associati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al Linguistics, pages<br />

1–8, Prague, Czech Republic. Associati<strong>on</strong> for Computati<strong>on</strong>al<br />

Linguistics.<br />

Christiane Fellbaum, edi<strong>to</strong>r. May 1998. WordNet: An<br />

Electr<strong>on</strong>ic Lexical Database. MIT Press.<br />

Joseph L. Fleiss. 1971.Measuring nominal scale agreement<br />

am<strong>on</strong>g many raters. Psychological Bulletin,<br />

76(5):378–382.<br />

Dan Kalman. 1996. A singularly valuable decompositi<strong>on</strong>:<br />

The svd <str<strong>on</strong>g>of</str<strong>on</strong>g> a matrix. The College Ma<str<strong>on</strong>g>the</str<strong>on</strong>g>matics<br />

Journal, 27(1):2–23.<br />

Thomas K. Landauer, Peter W. Foltz, and Darrell Laham.<br />

1998. An introducti<strong>on</strong> <strong>to</strong> latent semantic analysis.<br />

Discourse Processes, 25:259–284.<br />

Bob Rehder, Michael L. Littman, Susan T. Dumais, and<br />

Thomas K. Landauer. 1997. Au<strong>to</strong>matic 3-language<br />

cross-language informati<strong>on</strong> retrieval with latent semantic<br />

indexing. In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Sixth Text<br />

Retrieval C<strong>on</strong>ference (TREC-6), pages 233–239.<br />

Syandra Sari. 2007. Perolehan informasi lintas bahasa<br />

ind<strong>on</strong>esia-inggris berdasarkan korpus paralel dengan<br />

menggunakan me<strong>to</strong>da mutual informati<strong>on</strong> dan<br />

me<strong>to</strong>da similarity <str<strong>on</strong>g>the</str<strong>on</strong>g>saurus. Master’s <str<strong>on</strong>g>the</str<strong>on</strong>g>sis, Faculty<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science, University <str<strong>on</strong>g>of</str<strong>on</strong>g> Ind<strong>on</strong>esia, Call<br />

number: T-0617.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!