Comparing the value of Latent Semantic Analysis on two English-to ...
Comparing the value of Latent Semantic Analysis on two English-to ...
Comparing the value of Latent Semantic Analysis on two English-to ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
ilingual c<strong>on</strong>cept mapping due <strong>to</strong> its finer alignment<br />
granularity. While c<strong>on</strong>cept mapping attempts<br />
<strong>to</strong> map a c<strong>on</strong>cept c<strong>on</strong>veyed by a group <str<strong>on</strong>g>of</str<strong>on</strong>g> semantically<br />
related words, word mapping attempts <strong>to</strong> map<br />
a word with a specific meaning <strong>to</strong> its translati<strong>on</strong> in<br />
ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r language.<br />
In <str<strong>on</strong>g>the</str<strong>on</strong>g>ory, LSA employs rank reducti<strong>on</strong> <strong>to</strong> remove<br />
noise and <strong>to</strong> reveal underlying informati<strong>on</strong><br />
c<strong>on</strong>tained in a corpus. LSA has a ‘smoothing’ effect<br />
<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> matrix, which is useful <strong>to</strong> discover general<br />
patterns, e.g. clustering documents by semantic<br />
domain. Our experiment results, however, generally<br />
shows <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency baseline, which employs<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> full rank word-document matrix, outperforming<br />
LSA.<br />
We speculate that <str<strong>on</strong>g>the</str<strong>on</strong>g> rank reducti<strong>on</strong> perhaps<br />
blurs some crucial details necessary for word mapping.<br />
The frequency baseline seems <strong>to</strong> encode<br />
more cooccurrence than LSA. It compares word<br />
vec<strong>to</strong>rs between <strong>English</strong> and Ind<strong>on</strong>esian that c<strong>on</strong>tain<br />
pure frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> word occurrence in each<br />
document. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, LSA encodes more<br />
semantic relatedness. It compares <strong>English</strong> and Ind<strong>on</strong>esian<br />
word vec<strong>to</strong>rs c<strong>on</strong>taining estimates <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
word frequency in documents according <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
c<strong>on</strong>text meaning. Since <str<strong>on</strong>g>the</str<strong>on</strong>g> purpose <str<strong>on</strong>g>of</str<strong>on</strong>g> bilingual<br />
word mapping is <strong>to</strong> obtain proper translati<strong>on</strong>s for<br />
an <strong>English</strong> word, it may be better explained as an<br />
issue <str<strong>on</strong>g>of</str<strong>on</strong>g> cooccurrence ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than semantic relatedness.<br />
That is, <str<strong>on</strong>g>the</str<strong>on</strong>g> higher <str<strong>on</strong>g>the</str<strong>on</strong>g> rate <str<strong>on</strong>g>of</str<strong>on</strong>g> cooccurrence<br />
between an <strong>English</strong> and an Ind<strong>on</strong>esian word, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
likelier <str<strong>on</strong>g>the</str<strong>on</strong>g>y are <strong>to</strong> be translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r.<br />
LSA may yield better results in <str<strong>on</strong>g>the</str<strong>on</strong>g> case <str<strong>on</strong>g>of</str<strong>on</strong>g> finding<br />
words with similar semantic domains. Thus,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> LSA mapping results should be better assessed<br />
using a resource listing semantically related terms,<br />
ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than using a bilingual dicti<strong>on</strong>ary listing<br />
translati<strong>on</strong> pairs. A bilingual dicti<strong>on</strong>ary demands<br />
more specific c<strong>on</strong>straints than semantic relatedness,<br />
as it specifies that <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping results should<br />
be <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> an <strong>English</strong> word.<br />
Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, polysemous terms may become<br />
ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r problem for LSA. By rank approximati<strong>on</strong>,<br />
LSA estimates <str<strong>on</strong>g>the</str<strong>on</strong>g> occurrence frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> a word<br />
in a particular document. Since polysemy <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>English</strong><br />
terms and Ind<strong>on</strong>esian terms can be quite different,<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> estimati<strong>on</strong>s for words which are mutual<br />
translati<strong>on</strong>s can be different. For instance, kali and<br />
waktu are Ind<strong>on</strong>esian translati<strong>on</strong>s for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong><br />
word time. However, kali is also <str<strong>on</strong>g>the</str<strong>on</strong>g> Ind<strong>on</strong>esian<br />
translati<strong>on</strong> for <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>English</strong> word river. Suppose<br />
kali and time appear frequently in documents<br />
about multiplicati<strong>on</strong>, but kali and river appear<br />
rarely in documents about river. Then, waktu and<br />
time appear frequently in documents about time.<br />
As a result, LSA may estimate kali with greater<br />
frequency in documents about multiplicati<strong>on</strong> and<br />
time, but with lower frequency in documents about<br />
river. The word vec<strong>to</strong>rs between kali and river<br />
may not be similar. Thus, in bilingual word mapping,<br />
LSA may not suggest kali as <str<strong>on</strong>g>the</str<strong>on</strong>g> proper<br />
translati<strong>on</strong> for river. Although polysemous words<br />
can also be a problem for <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency baseline,<br />
it merely uses raw word frequency vec<strong>to</strong>rs, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
problem does not affect o<str<strong>on</strong>g>the</str<strong>on</strong>g>r word vec<strong>to</strong>rs. LSA,<br />
<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, exacerbates this problem by taking<br />
it in<strong>to</strong> account in estimating o<str<strong>on</strong>g>the</str<strong>on</strong>g>r word frequencies.<br />
6 Summary<br />
We have presented a model <str<strong>on</strong>g>of</str<strong>on</strong>g> computing bilingual<br />
word and c<strong>on</strong>cept mappings between <strong>two</strong> semantic<br />
lexic<strong>on</strong>s, in our case Princet<strong>on</strong> WordNet and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
KBBI, using an extensi<strong>on</strong> <strong>to</strong> LSA that exploits implicit<br />
semantic informati<strong>on</strong> c<strong>on</strong>tained within a parallel<br />
corpus.<br />
The results, whilst far from c<strong>on</strong>clusive, indicate<br />
that that for bilingual word mapping, LSA is c<strong>on</strong>sistently<br />
outperformed by <str<strong>on</strong>g>the</str<strong>on</strong>g> basic vec<strong>to</strong>r space<br />
model, i.e. where <str<strong>on</strong>g>the</str<strong>on</strong>g> full-rank word-document matrix<br />
is applied, whereas for bilingual c<strong>on</strong>cept mapping<br />
LSA seems <strong>to</strong> slightly improve results. We<br />
speculate that this is due <strong>to</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> fact that LSA,<br />
whilst very useful for revealing implicit semantic<br />
patterns, blurs <str<strong>on</strong>g>the</str<strong>on</strong>g> cooccurrence informati<strong>on</strong> that is<br />
necessary for establishing word translati<strong>on</strong>s.<br />
We suggest that, particularly for bilingual word<br />
mapping, a finer granularity <str<strong>on</strong>g>of</str<strong>on</strong>g> alignment, e.g. at<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> sentential level, may increase accuracy (Deng<br />
and Gao, 2007).<br />
Acknowledgment<br />
The work presented in this paper is supported by<br />
an RUUI (Riset Unggulan Universitas Ind<strong>on</strong>esia)<br />
2007 research grant from DRPM UI (Direk<strong>to</strong>rat<br />
Riset dan Pengabdian Masyarakat Universitas Ind<strong>on</strong>esia).<br />
We would also like <strong>to</strong> thank Franky for<br />
help in s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware implementati<strong>on</strong> and Desm<strong>on</strong>d<br />
Darma Putra for help in computing <str<strong>on</strong>g>the</str<strong>on</strong>g> Fleiss kappa<br />
<str<strong>on</strong>g>value</str<strong>on</strong>g>s in Secti<strong>on</strong> 4.3.