Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
6.4. A large <strong>the</strong>saurus for Portuguese 105<br />
Properties <strong>of</strong> <strong>the</strong> synonymy networks<br />
In a similar fashion to what was done for <strong>the</strong> complete network (table 5.1), table 6.6<br />
contains <strong>the</strong> total number <strong>of</strong> nodes (|V |) and edges (|E|), and <strong>the</strong> average network<br />
degree (deg(N), computed according to expression 5.5). It contains as well <strong>the</strong><br />
number <strong>of</strong> sub-networks (Sub-nets), which are group <strong>of</strong> nodes connected directly<br />
or indirectly in N; <strong>the</strong> number <strong>of</strong> nodes <strong>of</strong> <strong>the</strong> largest and second largest subnetworks<br />
(|Vlcs| and |Vlcs2|); and <strong>the</strong> average clustering coefficient <strong>of</strong> <strong>the</strong> largest<br />
sub-network (CClcs, computed according to expression 5.7).<br />
From table 6.6, we notice that <strong>the</strong>se synonymy networks are significantly different<br />
from <strong>the</strong> original. First, <strong>the</strong>y are smaller, as <strong>the</strong>y only contain about half <strong>of</strong> <strong>the</strong><br />
nouns, one sixth <strong>of</strong> <strong>the</strong> verbs and one third <strong>of</strong> <strong>the</strong> adjectives. Second, <strong>the</strong>y have<br />
substantially lower degrees, and clustering coefficients close to 0, which means <strong>the</strong>y<br />
are less connected and do not tend to form clusters. Never<strong>the</strong>less, <strong>the</strong>y still have<br />
one large core sub-network and several smaller.<br />
This confirms that a simpler clustering algorithm is suitable for our purpose,<br />
especially because ambiguity is much lower and several clusters are already defined<br />
by complete small sub-networks. The noun network contains 4,470 sub-networks <strong>of</strong><br />
size 2 and 1,127 <strong>of</strong> size 3. These numbers are respectively 437 and 97 for verbs, and<br />
1,303 and 262 for adjectives.<br />
POS |V | |E| deg(N) Sub-nets |Vlcs| CClcs |Vlcs2|<br />
Noun 21,272 15,294 1.44 6,556 2,816 0.03 66<br />
Verb 1,807 1,197 1.32 614 153 0.00 29<br />
Adjective 4,695 3,050 1.30 1,743 169 0.02 50<br />
Table 6.6: Properties <strong>of</strong> <strong>the</strong> synonymy networks remaining after assignment.<br />
Clustering Examples<br />
Figures 6.2, 6.3 and 6.4 illustrate <strong>the</strong> result <strong>of</strong> clustering in three sub-networks. The<br />
first sub-network results in only one cluster, with several synonyms for someone<br />
who speaks Greek. The second and <strong>the</strong> third are divided into different clusters,<br />
represented by different shades <strong>of</strong> grey.<br />
In figure 6.3, <strong>the</strong> sub-network is divided in two different meanings <strong>of</strong> <strong>the</strong> verb<br />
’splash’, one <strong>of</strong> <strong>the</strong>m more abstract (esparrinhar), and <strong>the</strong> o<strong>the</strong>r done with <strong>the</strong><br />
feet or hands (bachicar), but three words may be used with both meanings. The<br />
meanings covered by <strong>the</strong> four clusters in figure 6.4 are, respectively: a person who<br />
gives moral qualities; a person who evangelises; a person who spreads ideas; and a<br />
person who is an active member <strong>of</strong> a cause.<br />
Evaluation <strong>of</strong> <strong>the</strong> clustering results<br />
In order to check if <strong>the</strong> algorithm described in section 6.3 is efficient, and to have an<br />
idea on <strong>the</strong> quality <strong>of</strong> <strong>the</strong> discovered clusters, <strong>the</strong>ir manual evaluation was performed.<br />
Once again, we had two judges classifying pairs <strong>of</strong> words, collected from <strong>the</strong> same<br />
synset, as synonymous or not. This kind <strong>of</strong> evaluation is easier and slightly less<br />
subjective than <strong>the</strong> evaluation <strong>of</strong> complete synsets. Fur<strong>the</strong>rmore, in section 5.3.5<br />
we reported similar results using both kinds <strong>of</strong> evaluation.