18.07.2013 Views

presentation-wordnet..

presentation-wordnet..

presentation-wordnet..

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

DanNet<br />

From Dictionary<br />

to Wordnet<br />

Jörg Asmussen<br />

Society for Danish Language and Literature, DSL, Copenhagen<br />

Bolette Sandford Pedersen<br />

Centre for Language Technology, CST, University of Copenhagen<br />

Lars Trap-Jensen<br />

Society for Danish Language and Literature, DSL, Copenhagen


Outline<br />

1. Introduction LTJ, 2 min.<br />

2. Characteristics of the DDO LTJ, 5 min.<br />

3. Building DanNet BSP, 8 min.<br />

4. Extraction of differentia info JA, 7 min.<br />

5. Conclusions JA, 2 min


•<br />

•<br />

•<br />

DanNet<br />

Lexical-semantic <strong>wordnet</strong> for Danish<br />

Joint project<br />

•<br />

•<br />

Society for Danish Language and<br />

Literature<br />

Centre for Language Technology,<br />

University of Copenhagen<br />

4 years (2005 – 2008), ~ 400,000 €


•<br />

•<br />

Limited resources<br />

Adapt an existing <strong>wordnet</strong>?<br />

or<br />

Reuse other lexical-semantic resources:<br />

• SIMPLE-DK<br />

• Den Danske Ordbog, DDO


1. Introduction<br />

Outline<br />

2. Characteristics of the DDO<br />

3. Building DanNet<br />

4. Extraction of differentia info from definitons<br />

5. Conclusions


Den Danske Ordbog<br />

• Published by DSL 2003–5<br />

• Corpus-based, DDOC<br />

• 60,000 entries<br />

• Spelling, morphology,<br />

pronunciation, meaning,<br />

collocations,<br />

fixed phrases, syntax,<br />

usage, word formation,<br />

etymology


•<br />

•<br />

•<br />

•<br />

Den Danske Ordbog<br />

Words edited in related groups<br />

Machine readable<br />

Fine-grained microstructure<br />

100,000 definitions


Semantic description


Semantic description<br />

Systematic<br />

domain info<br />

→ concerns relation


Semantic description<br />

Sense<br />

definition<br />

→ relevant info<br />

„manually“ extracted


Semantic description<br />

Hyperonym


Semantic description<br />

Sense<br />

relations,<br />

i.e. synonyms


Semantic description<br />

Collocational<br />

information


Semantic description<br />

Authentic<br />

example


Semantic description


Definitions in the DDO<br />

Definition scheme:<br />

•<br />

•<br />

Genus proximum – closest hyperonym:<br />

apparat ‚technical device‘<br />

Differentia specifica – distinctive feature:<br />

remaining part of the definition


1. Introduction<br />

Outline<br />

2. Characteristics of the DDO<br />

3. Building DanNet<br />

4. Extraction of differentia info from definitons<br />

5. Conclusions


•<br />

•<br />

•<br />

Building DanNet<br />

Extract definitions and genus specifications<br />

Include them in the DanNet tool<br />

Use it for domain-wise development of data:<br />

1. Homonymy and polysemy<br />

2. Establishing synsets<br />

3. Adjusting the hierarchical structure


Homonymy & polysemy<br />

celle ‚cell‘ is genus proximum of<br />

•<br />

•<br />

gærcelle ,yeast cell‘<br />

fængselscelle ‚prison cell‘<br />

Convert lexical expressions into concepts:<br />

•<br />

•<br />

celle-1 ‚part of living organism‘<br />

celle-2 ,small room‘


informatik<br />

‚informatics‘<br />

Establishing synsets<br />

lære<br />

‚studies‘<br />

bromatologi<br />

‚nutrition science‘<br />

fag<br />

‚subject‘<br />

samfundsfag<br />

‚social studies‘<br />

videnskab<br />

‚science‘<br />

datalogi<br />

‚computer science‘


informatik<br />

‚informatics‘<br />

Establishing synsets<br />

lære<br />

‚studies‘<br />

bromatologi<br />

‚nutrition science‘<br />

fag<br />

‚subject‘<br />

samfundsfag<br />

‚social studies‘<br />

One synset<br />

videnskab<br />

‚science‘<br />

datalogi<br />

‚computer science‘


Building the hierarchy<br />

Hyponymy is generally defined as<br />

•<br />

X is a Y<br />

Taxonymy is a subtype of this:<br />

•<br />

X is a kind/type of Y<br />

Cf. Cruse, 1991 and 2002


Example: Hyponymy?<br />

kirsebærtræ<br />

‚cherry tree‘<br />

træ<br />

‚tree‘<br />

birketræ<br />

‚birch‘<br />

vejtræ<br />

‚roadside tree‘


Example: Hyponymy?<br />

kirsebærtræ<br />

‚cherry tree‘<br />

træ<br />

‚tree‘<br />

birketræ<br />

‚birch‘<br />

vejtræ<br />

‚roadside tree‘<br />

„Orthogonal“<br />

Hyponymy


Building the hierarchy<br />

TOP<br />

genstand ‚object‘<br />

møbel ‚furniture‘<br />

siddemøbel ‚sitting furniture‘<br />

stol ‚chair‘


Building the hierarchy<br />

TOP<br />

genstand ‚object‘<br />

møbel ‚furniture‘<br />

siddemøbel ‚sitting furniture‘<br />

stol ‚chair‘<br />

indbo/bohave ‚household effects‘


Building the hierarchy<br />

TOP<br />

genstand ‚object‘<br />

møbel ‚furniture‘<br />

siddemøbel ‚sitting furniture‘<br />

stol ‚chair‘<br />

indbo/bohave ‚household effects‘


Definition composition<br />

• Genus selection – a conscious process<br />

• Differentia:<br />

• No editorial specifications, i.e. no fixed<br />

definition vocabulary nor syntax<br />

•<br />

Consequences for DanNet:<br />

•<br />

•<br />

Complicates computational exploitation<br />

Semantic relations are coded manually


•<br />

•<br />

Coding relations<br />

What is done manually:<br />

•<br />

•<br />

No semantic info other than that of DDO<br />

Reduction of semantic info<br />

What is done automatically:<br />

•<br />

Inheritance of relations from hyperonyms


1. Introduction<br />

Outline<br />

2. Characteristics of the DDO<br />

3. Building DanNet<br />

4. Extraction of differentia info from definitons<br />

5. Conclusions


Extraction of telic role<br />

fjernsyn ‚tv set‘<br />

‚box-shaped device that can receive tv signals<br />

and transform them into animated pictures<br />

on a screen and accompanying sound in the<br />

speakers of the device‘


Extraction of telic role<br />

fjernsyn ‚tv set‘<br />

genus<br />

expression<br />

‚box-shaped device that can receive tv signals<br />

and transform them into animated pictures<br />

on a screen and accompanying sound in the<br />

speakers of the device‘


Extraction of telic role<br />

fjernsyn ‚tv set‘<br />

genus<br />

expression<br />

‚box-shaped device that can receive tv signals<br />

and transform them into animated pictures<br />

on a screen and accompanying sound in the<br />

speakers of the device‘<br />

Telic role:<br />

VPs headed by ‚can‘


Extraction of telic role<br />

fjernsyn ‚tv set‘<br />

genus<br />

expression<br />

‚box-shaped device that can receive tv signals<br />

and transform them into animated pictures<br />

on a screen and accompanying sound in the<br />

speakers of the device‘<br />

Telic role:<br />

VPs headed by ‚can‘


Hypothesis


Hypothesis<br />

‣ VPs in a relative clause which are headed by<br />

kan ‚can‘ specify the telic role (i.e. the<br />

for_purpose_of relation) of the definiendum


Hypothesis<br />

Corpus query<br />

‣ VPs in Find a relative all definitions clause with which genus are apparat headed by<br />

kan ‚can‘ specify followed the by telic der role or som (i.e. the<br />

for_purpose_of relation) followed by of kan the definiendum<br />

followed by a word ending in e


Results of corpus query


Results of corpus query<br />

query<br />

VP<br />

heads denoting<br />

telic role<br />

dictionary<br />

entries


Results of corpus query<br />

query<br />

VP<br />

heads denoting<br />

telic role<br />

Only 26 occurrences<br />

of this pattern – but 203<br />

dictionary<br />

entries<br />

apparat definitions


Why this bad coverage?


Why this bad coverage?<br />

1. Definitions where the pattern contains<br />

interposed material are not captured


Why this bad coverage?<br />

1. Definitions where the pattern contains<br />

interposed material are not captured<br />

2. Other stuctural patterns indicating a<br />

for_purpose_of relation than that one given in<br />

our hypothesis


Further patterns<br />

1. GE that can VP-inf<br />

2. GE that is used for to VP-inf with<br />

3. GE for to VP-inf with/on/in<br />

4. GE that VP-fin<br />

5. GE for NP<br />

6. GE that is specially designed for to VP-inf


1. GE that can VP-inf<br />

2. GE that is used for to VP-inf with<br />

3. GE for to VP-inf with/on/in<br />

4. GE that VP-fin<br />

5. GE for NP<br />

Further patterns<br />

head<br />

6. GE that is specially designed for to VP-inf<br />

for_purpose_of


1. GE that can VP-inf<br />

2. GE that is used for to VP-inf with<br />

3. GE for to VP-inf with/on/in<br />

4. GE that VP-fin<br />

5. GE for NP<br />

Further patterns<br />

head<br />

These patterns<br />

6. GE that is specially designed for to VP-inf<br />

for_purpose_of<br />

capture 70% of the apparat<br />

definitions


A statistical approach


•<br />

A statistical approach<br />

Frequency list of types in definitions with<br />

genus apparat


•<br />

A statistical approach<br />

Frequency list of types in definitions with<br />

genus apparat<br />

compared with


•<br />

•<br />

A statistical approach<br />

Frequency list of types in definitions with<br />

genus apparat<br />

compared with<br />

frequency list of types in all definitions


•<br />

•<br />

A statistical approach<br />

Frequency list of types in definitions with<br />

genus apparat<br />

compared with<br />

frequency list of types in all definitions<br />

using a statistical test (e.g. log likelihood)


•<br />

•<br />

A statistical approach<br />

Frequency list of types in definitions with<br />

genus apparat<br />

compared with<br />

frequency list of types in all definitions<br />

using a statistical test (e.g. log likelihood)<br />

‣ Salient types are listed for investigation and<br />

may give hints on semantic relations


•<br />

•<br />

•<br />

•<br />

•<br />

•<br />

Some salient types<br />

afspille ‚to play back‘<br />

afspilning ‚play back‘<br />

måle ,measure‘<br />

måler ,measuring tool‘<br />

måling ,gauging‘<br />

målinger ,measurements‘


•<br />

•<br />

•<br />

•<br />

•<br />

•<br />

Some salient types<br />

afspille ‚to play back‘<br />

afspilning ‚play back‘<br />

måle ,measure‘<br />

måler ,measuring tool‘<br />

måling ,gauging‘<br />

målinger ,measurements‘<br />

grammofon,<br />

cd-afspiller, afspiller, sequencer,<br />

diktafon<br />

kassettespiller,<br />

hjemmevideo, kassettebåndoptager,<br />

båndoptager<br />

stroboskop,<br />

måler, timer, løgnedetektor, ekkolod<br />

gasmåler,<br />

speedometer, omdrejningstæller,<br />

benzinmåler, fotofælde<br />

elmåler,<br />

trykmåler, luxmeter, spirometer,<br />

gyrometer, alkometer, newtonmeter,<br />

magnetometer, instrument,<br />

måleinstrument, kalorimeter<br />

radiosonde, satellit, fartskriver


Automatic extraction?


Automatic extraction?<br />

Basically NO...<br />

Developing reliant methods is<br />

too expensive!


Automatic extraction?<br />

•<br />

Structural and lexical properties of<br />

definitions differ considerably


Automatic extraction?<br />

•<br />

‣<br />

Structural and lexical properties of<br />

definitions differ considerably<br />

Difficult to automatically extract semantic<br />

relations from definitions


Automatic extraction?<br />

•<br />

‣<br />

‣<br />

Structural and lexical properties of<br />

definitions differ considerably<br />

Difficult to automatically extract semantic<br />

relations from definitions<br />

Concordances and lists of salient definition<br />

types may help the editor


Automatic extraction?<br />

•<br />

‣<br />

‣<br />

‣<br />

Structural and lexical properties of<br />

definitions differ considerably<br />

Difficult to automatically extract semantic<br />

relations from definitions<br />

Concordances and lists of salient definition<br />

types may help the editor<br />

But the DanNet editor still has to do the<br />

core job of analysing dictionary definitions


1. Introduction<br />

Outline<br />

2. Characteristics of the DDO<br />

3. Building DanNet<br />

4. Extraction of differentia info from definitons<br />

5. Conclusions


Conclusion<br />

Reusing the DDO


Cheap<br />

Expensive<br />

Conclusion<br />

Reusing the DDO


Cheap<br />

Expensive<br />

Conclusion<br />

Reusing the DDO<br />

Semi-automatic exploitation of the dictionary<br />

structure<br />

•<br />

•<br />

hyponymy structure<br />

synonym/antonym info


Cheap<br />

Expensive<br />

Conclusion<br />

Reusing the DDO<br />

Semi-automatic exploitation of the dictionary<br />

structure<br />

•<br />

•<br />

hyponymy structure<br />

synonym/antonym info<br />

Automatic exploitation of definitions proper<br />

to find other semantic relations


Cheap<br />

Expensive<br />

Conclusion<br />

Reusing the DDO<br />

Semi-automatic exploitation of the dictionary<br />

structure<br />

•<br />

•<br />

hyponymy structure<br />

synonym/antonym info<br />

Automatic exploitation of definitions proper<br />

to find other semantic relations


Conclusion<br />

The DanNet approach


Cheap<br />

Expensive<br />

Conclusion<br />

The DanNet approach


Cheap<br />

Expensive<br />

Conclusion<br />

The DanNet approach<br />

Translation/expansion of existing WNs?<br />

• Better coherence with other WNs<br />

• Linguistic bias


Cheap<br />

Expensive<br />

Conclusion<br />

The DanNet approach<br />

Translation/expansion of existing WNs?<br />

• Better coherence with other WNs<br />

• Linguistic bias<br />

Reusing/merging language resources?<br />

• More loyal to the specific language<br />

• Expensive, unless based on an existing<br />

resource, i.e. a dictionary


Cheap<br />

Expensive<br />

Conclusion<br />

The DanNet approach<br />

Translation/expansion of existing WNs?<br />

• Better coherence with other WNs<br />

• Linguistic bias<br />

Reusing/merging language resources?<br />

• More loyal to the specific language<br />

• Expensive, unless based on an existing<br />

resource, i.e. a dictionary

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!