26.04.2015 Views

Comparing the value of Latent Semantic Analysis on two English-to ...

Comparing the value of Latent Semantic Analysis on two English-to ...

Comparing the value of Latent Semantic Analysis on two English-to ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ilingual c<strong>on</strong>cept mapping due <strong>to</strong> its finer alignment<br />

granularity. While c<strong>on</strong>cept mapping attempts<br />

<strong>to</strong> map a c<strong>on</strong>cept c<strong>on</strong>veyed by a group <str<strong>on</strong>g>of</str<strong>on</strong>g> semantically<br />

related words, word mapping attempts <strong>to</strong> map<br />

a word with a specific meaning <strong>to</strong> its translati<strong>on</strong> in<br />

ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r language.<br />

In <str<strong>on</strong>g>the</str<strong>on</strong>g>ory, LSA employs rank reducti<strong>on</strong> <strong>to</strong> remove<br />

noise and <strong>to</strong> reveal underlying informati<strong>on</strong><br />

c<strong>on</strong>tained in a corpus. LSA has a ‘smoothing’ effect<br />

<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> matrix, which is useful <strong>to</strong> discover general<br />

patterns, e.g. clustering documents by semantic<br />

domain. Our experiment results, however, generally<br />

shows <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency baseline, which employs<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> full rank word-document matrix, outperforming<br />

LSA.<br />

We speculate that <str<strong>on</strong>g>the</str<strong>on</strong>g> rank reducti<strong>on</strong> perhaps<br />

blurs some crucial details necessary for word mapping.<br />

The frequency baseline seems <strong>to</strong> encode<br />

more cooccurrence than LSA. It compares word<br />

vec<strong>to</strong>rs between <strong>English</strong> and Ind<strong>on</strong>esian that c<strong>on</strong>tain<br />

pure frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> word occurrence in each<br />

document. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, LSA encodes more<br />

semantic relatedness. It compares <strong>English</strong> and Ind<strong>on</strong>esian<br />

word vec<strong>to</strong>rs c<strong>on</strong>taining estimates <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

word frequency in documents according <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

c<strong>on</strong>text meaning. Since <str<strong>on</strong>g>the</str<strong>on</strong>g> purpose <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual<br />

word mapping is <strong>to</strong> obtain proper translati<strong>on</strong>s for<br />

an <strong>English</strong> word, it may be better explained as an<br />

issue <str<strong>on</strong>g>of</str<strong>on</strong>g> cooccurrence ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than semantic relatedness.<br />

That is, <str<strong>on</strong>g>the</str<strong>on</strong>g> higher <str<strong>on</strong>g>the</str<strong>on</strong>g> rate <str<strong>on</strong>g>of</str<strong>on</strong>g> cooccurrence<br />

between an <strong>English</strong> and an Ind<strong>on</strong>esian word, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

likelier <str<strong>on</strong>g>the</str<strong>on</strong>g>y are <strong>to</strong> be translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r.<br />

LSA may yield better results in <str<strong>on</strong>g>the</str<strong>on</strong>g> case <str<strong>on</strong>g>of</str<strong>on</strong>g> finding<br />

words with similar semantic domains. Thus,<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> LSA mapping results should be better assessed<br />

using a resource listing semantically related terms,<br />

ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than using a bilingual dicti<strong>on</strong>ary listing<br />

translati<strong>on</strong> pairs. A bilingual dicti<strong>on</strong>ary demands<br />

more specific c<strong>on</strong>straints than semantic relatedness,<br />

as it specifies that <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping results should<br />

be <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>English</strong> word.<br />

Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, polysemous terms may become<br />

ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r problem for LSA. By rank approximati<strong>on</strong>,<br />

LSA estimates <str<strong>on</strong>g>the</str<strong>on</strong>g> occurrence frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> a word<br />

in a particular document. Since polysemy <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>English</strong><br />

terms and Ind<strong>on</strong>esian terms can be quite different,<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> estimati<strong>on</strong>s for words which are mutual<br />

translati<strong>on</strong>s can be different. For instance, kali and<br />

waktu are Ind<strong>on</strong>esian translati<strong>on</strong>s for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong><br />

word time. However, kali is also <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian<br />

translati<strong>on</strong> for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> word river. Suppose<br />

kali and time appear frequently in documents<br />

about multiplicati<strong>on</strong>, but kali and river appear<br />

rarely in documents about river. Then, waktu and<br />

time appear frequently in documents about time.<br />

As a result, LSA may estimate kali with greater<br />

frequency in documents about multiplicati<strong>on</strong> and<br />

time, but with lower frequency in documents about<br />

river. The word vec<strong>to</strong>rs between kali and river<br />

may not be similar. Thus, in bilingual word mapping,<br />

LSA may not suggest kali as <str<strong>on</strong>g>the</str<strong>on</strong>g> proper<br />

translati<strong>on</strong> for river. Although polysemous words<br />

can also be a problem for <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency baseline,<br />

it merely uses raw word frequency vec<strong>to</strong>rs, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

problem does not affect o<str<strong>on</strong>g>the</str<strong>on</strong>g>r word vec<strong>to</strong>rs. LSA,<br />

<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, exacerbates this problem by taking<br />

it in<strong>to</strong> account in estimating o<str<strong>on</strong>g>the</str<strong>on</strong>g>r word frequencies.<br />

6 Summary<br />

We have presented a model <str<strong>on</strong>g>of</str<strong>on</strong>g> computing bilingual<br />

word and c<strong>on</strong>cept mappings between <strong>two</strong> semantic<br />

lexic<strong>on</strong>s, in our case Princet<strong>on</strong> WordNet and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

KBBI, using an extensi<strong>on</strong> <strong>to</strong> LSA that exploits implicit<br />

semantic informati<strong>on</strong> c<strong>on</strong>tained within a parallel<br />

corpus.<br />

The results, whilst far from c<strong>on</strong>clusive, indicate<br />

that that for bilingual word mapping, LSA is c<strong>on</strong>sistently<br />

outperformed by <str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space<br />

model, i.e. where <str<strong>on</strong>g>the</str<strong>on</strong>g> full-rank word-document matrix<br />

is applied, whereas for bilingual c<strong>on</strong>cept mapping<br />

LSA seems <strong>to</strong> slightly improve results. We<br />

speculate that this is due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that LSA,<br />

whilst very useful for revealing implicit semantic<br />

patterns, blurs <str<strong>on</strong>g>the</str<strong>on</strong>g> cooccurrence informati<strong>on</strong> that is<br />

necessary for establishing word translati<strong>on</strong>s.<br />

We suggest that, particularly for bilingual word<br />

mapping, a finer granularity <str<strong>on</strong>g>of</str<strong>on</strong>g> alignment, e.g. at<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> sentential level, may increase accuracy (Deng<br />

and Gao, 2007).<br />

Acknowledgment<br />

The work presented in this paper is supported by<br />

an RUUI (Riset Unggulan Universitas Ind<strong>on</strong>esia)<br />

2007 research grant from DRPM UI (Direk<strong>to</strong>rat<br />

Riset dan Pengabdian Masyarakat Universitas Ind<strong>on</strong>esia).<br />

We would also like <strong>to</strong> thank Franky for<br />

help in s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware implementati<strong>on</strong> and Desm<strong>on</strong>d<br />

Darma Putra for help in computing <str<strong>on</strong>g>the</str<strong>on</strong>g> Fleiss kappa<br />

<str<strong>on</strong>g>value</str<strong>on</strong>g>s in Secti<strong>on</strong> 4.3.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!