18.07.2013 Views

Automatic detection of new domain-specific words, using document ...

Automatic detection of new domain-specific words, using document ...

Automatic detection of new domain-specific words, using document ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Automatic</strong> <strong>detection</strong><br />

<strong>of</strong> <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong>,<br />

<strong>using</strong> <strong>document</strong> classification<br />

and frequency pr<strong>of</strong>iling<br />

—<br />

Jørg Asmussen<br />

Department for Digital Dictionaries and Text Corpora<br />

Society for Danish Language and Literature, DSL<br />

ja@dsl.dk<br />

—<br />

Corpus Linguistics 2005


Contents<br />

1 Background: updating the DDO 4<br />

1.1 Source <strong>of</strong> <strong>new</strong> vocabulary . . . . . . . . . . . . . . . . 5<br />

1.2 Updating task . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.3 Prerequisites for the updating task . . . . . . . . . . . 6<br />

2 Data: the DDO Corpus 7<br />

3 Analysis: deriving <strong>domain</strong> vocabularies 8<br />

3.1 Examples: most salient types for 3 <strong>domain</strong>s . . . . . . 9<br />

3.2 Quantitative problems with the approach . . . . . . . . 10<br />

3.3 Qualitative problems with the approach . . . . . . . . 11<br />

4 Text classification 12<br />

4.1 Starting point . . . . . . . . . . . . . . . . . . . . . . 12<br />

4.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . 13


4.3 To compute a score for a certain <strong>domain</strong>. . . . . . . . . 14<br />

5 Comparison: determining <strong>new</strong> <strong>words</strong> 15<br />

5.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

5.3 Classification . . . . . . . . . . . . . . . . . . . . . . 17<br />

5.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . 17<br />

5.5 New word candidates . . . . . . . . . . . . . . . . . . 18<br />

5.6 New sense candidates . . . . . . . . . . . . . . . . . . 19<br />

6 Discussion, future, and conclusion 20<br />

6.1 Basic decisions – and questions . . . . . . . . . . . . . 21<br />

6.2 Testing in progress . . . . . . . . . . . . . . . . . . . 24<br />

6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . 25<br />

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 26


1. Background: updating the DDO


1. Background: updating the DDO<br />

• Corpus-based dictionary <strong>of</strong> modern Danish


1. Background: updating the DDO<br />

• Corpus-based dictionary <strong>of</strong> modern Danish<br />

• Published 2003-2005 in six volumes


1. Background: updating the DDO<br />

• Corpus-based dictionary <strong>of</strong> modern Danish<br />

• Published 2003-2005 in six volumes<br />

• Enhanced web-based version under preparation: ordnet.dk


1. Background: updating the DDO<br />

• Corpus-based dictionary <strong>of</strong> modern Danish<br />

• Published 2003-2005 in six volumes<br />

• Enhanced web-based version under preparation: ordnet.dk<br />

Enhanced =⇒ continuously updated with <strong>new</strong> vocabulary


1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />

• Mainly <strong>new</strong>spaper material<br />

– drawn from the Danish media database, www.infomedia.dk


1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />

• Mainly <strong>new</strong>spaper material<br />

– drawn from the Danish media database, www.infomedia.dk<br />

1.2. Updating task<br />

• Prior to lexicographic description, <strong>new</strong> vocabulary is subdivided<br />

into <strong>domain</strong>s


1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />

• Mainly <strong>new</strong>spaper material<br />

– drawn from the Danish media database, www.infomedia.dk<br />

1.2. Updating task<br />

• Prior to lexicographic description, <strong>new</strong> vocabulary is subdivided<br />

into <strong>domain</strong>s<br />

−→ Assign incoming texts to particular <strong>domain</strong>s


1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />

• Mainly <strong>new</strong>spaper material<br />

– drawn from the Danish media database, www.infomedia.dk<br />

1.2. Updating task<br />

• Prior to lexicographic description, <strong>new</strong> vocabulary is subdivided<br />

into <strong>domain</strong>s<br />

−→ Assign incoming texts to particular <strong>domain</strong>s<br />

−→ Extract from them <strong>words</strong> so far not listed in the dictionary


1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />

• Mainly <strong>new</strong>spaper material<br />

– drawn from the Danish media database, www.infomedia.dk<br />

1.2. Updating task<br />

• Prior to lexicographic description, <strong>new</strong> vocabulary is subdivided<br />

into <strong>domain</strong>s<br />

−→ Assign incoming texts to particular <strong>domain</strong>s<br />

−→ Extract from them <strong>words</strong> so far not listed in the dictionary<br />

−→ Treat them as candidates for inclusion


1.3. Prerequisites for the updating task<br />

1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:


1.3. Prerequisites for the updating task<br />

1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:<br />

(a) granularity – how many different <strong>domain</strong>s?<br />

(b) contents – how to define a certain <strong>domain</strong> intensionally?<br />

−→


1.3. Prerequisites for the updating task<br />

1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:<br />

(a) granularity – how many different <strong>domain</strong>s?<br />

(b) contents – how to define a certain <strong>domain</strong> intensionally?<br />

−→ Decimal classification system DK5 (cf. Dewey’s)


1.3. Prerequisites for the updating task<br />

1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:<br />

(a) granularity – how many different <strong>domain</strong>s?<br />

(b) contents – how to define a certain <strong>domain</strong> intensionally?<br />

−→ Decimal classification system DK5 (cf. Dewey’s)<br />

2. Classification procedure – how to assign a text to a <strong>domain</strong>?<br />

−→


1.3. Prerequisites for the updating task<br />

1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:<br />

(a) granularity – how many different <strong>domain</strong>s?<br />

(b) contents – how to define a certain <strong>domain</strong> intensionally?<br />

−→ Decimal classification system DK5 (cf. Dewey’s)<br />

2. Classification procedure – how to assign a text to a <strong>domain</strong>?<br />

−→ Statistical, based on the DDOC


2. Data: the DDO Corpus<br />

The Corpus <strong>of</strong> the Danish Dictionary, DDOC:


2. Data: the DDO Corpus<br />

The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />

• 43000 text samples totalling 40 million <strong>words</strong>


2. Data: the DDO Corpus<br />

The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />

• 43000 text samples totalling 40 million <strong>words</strong><br />

• Compiled by DSL 1991-93


2. Data: the DDO Corpus<br />

The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />

• 43000 text samples totalling 40 million <strong>words</strong><br />

• Compiled by DSL 1991-93<br />

• Broad coverage <strong>of</strong> Danish language from 1983-1992


2. Data: the DDO Corpus<br />

The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />

• 43000 text samples totalling 40 million <strong>words</strong><br />

• Compiled by DSL 1991-93<br />

• Broad coverage <strong>of</strong> Danish language from 1983-1992<br />

• 88.6% <strong>of</strong> the text samples are assigned to one <strong>of</strong> 66 <strong>domain</strong>s


2. Data: the DDO Corpus<br />

The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />

• 43000 text samples totalling 40 million <strong>words</strong><br />

• Compiled by DSL 1991-93<br />

• Broad coverage <strong>of</strong> Danish language from 1983-1992<br />

• 88.6% <strong>of</strong> the text samples are assigned to one <strong>of</strong> 66 <strong>domain</strong>s<br />

From this material, we will


2. Data: the DDO Corpus<br />

The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />

• 43000 text samples totalling 40 million <strong>words</strong><br />

• Compiled by DSL 1991-93<br />

• Broad coverage <strong>of</strong> Danish language from 1983-1992<br />

• 88.6% <strong>of</strong> the text samples are assigned to one <strong>of</strong> 66 <strong>domain</strong>s<br />

From this material, we will<br />

• Extract 66 <strong>domain</strong>-<strong>specific</strong> vocabularies<br />

• Use these vocabularies to classify unseen text


3. Analysis: deriving <strong>domain</strong> vocabularies<br />

1. Creation <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />

DDOC <strong>domain</strong> codes =⇒ 66 <strong>domain</strong>-<strong>specific</strong> subcorpora


3. Analysis: deriving <strong>domain</strong> vocabularies<br />

1. Creation <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />

DDOC <strong>domain</strong> codes =⇒ 66 <strong>domain</strong>-<strong>specific</strong> subcorpora<br />

2. Creation <strong>of</strong> frequency pr<strong>of</strong>iles:<br />

DDOC + subcopora =⇒ frequency pr<strong>of</strong>iles


3. Analysis: deriving <strong>domain</strong> vocabularies<br />

1. Creation <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />

DDOC <strong>domain</strong> codes =⇒ 66 <strong>domain</strong>-<strong>specific</strong> subcorpora<br />

2. Creation <strong>of</strong> frequency pr<strong>of</strong>iles:<br />

DDOC + subcopora =⇒ frequency pr<strong>of</strong>iles<br />

3. Comparison <strong>of</strong> frequency pr<strong>of</strong>iles:<br />

• Each <strong>of</strong> the 66 frequency pr<strong>of</strong>iles is compared with the frequency<br />

pr<strong>of</strong>ile <strong>of</strong> the total DDOC. Significance test: log<br />

likelihood.<br />

• Over-represented types (p ≥ 0.99) within a <strong>domain</strong> yield a<br />

<strong>domain</strong>-<strong>specific</strong> vocabulary D.


3.1. Examples: most salient types for 3 <strong>domain</strong>s<br />

D computing D philosophy Deconomy<br />

data mennesket ‘man’ kr Danish unit <strong>of</strong> currency (abbreviated)<br />

programmer ‘programs’ kierkegaard X,X amount (with decimals)<br />

computer moral pct<br />

computeren ‘the computer’ løgstrup Danish philosopher procent ‘percent’<br />

edb ‘computing’ aristoteles kroner Danish unit <strong>of</strong> currency<br />

computere ‘computers’ filos<strong>of</strong>i rente ‘interest’<br />

ibm fornuft ‘ratio’ <strong>of</strong>fentlige ‘public’<br />

pc platon økonomiske ‘economic’<br />

kan ‘can’ kierkegaards bank<br />

mb tim probably a name X figure, number<br />

apple den ‘the’ økonomi ‘economy’<br />

amiga menneskets ‘man’s’ vil ‘will, shall’<br />

commodore filos<strong>of</strong> ‘philosopher’ mia


3.2. Quantitative problems with the approach<br />

Arbitrary significance level (p ≥ 0.99):<br />

• influences the number <strong>of</strong> types in each <strong>domain</strong>


3.2. Quantitative problems with the approach<br />

Arbitrary significance level (p ≥ 0.99):<br />

• influences the number <strong>of</strong> types in each <strong>domain</strong><br />

Differently sized <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />

• causes the according vocabularies to be <strong>of</strong> different size:<br />

Folklore <strong>domain</strong>: 1957 types<br />

Sport <strong>domain</strong>: 16022 types<br />

Average for all <strong>domain</strong>s: 7256 types


3.2. Quantitative problems with the approach<br />

Arbitrary significance level (p ≥ 0.99):<br />

• influences the number <strong>of</strong> types in each <strong>domain</strong><br />

Differently sized <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />

• causes the according vocabularies to be <strong>of</strong> different size:<br />

Folklore <strong>domain</strong>: 1957 types<br />

Sport <strong>domain</strong>: 16022 types<br />

Average for all <strong>domain</strong>s: 7256 types<br />

−→ Take into account the size <strong>of</strong> these vocabularies


3.3. Qualitative problems with the approach<br />

2. Frequent function <strong>words</strong> appear saliently ranked, e.g.,<br />

den (‘the’)<br />

vil (‘will’)


3.3. Qualitative problems with the approach<br />

2. Frequent function <strong>words</strong> appear saliently ranked, e.g.,<br />

den (‘the’)<br />

vil (‘will’)<br />

−→ Not excluded from the vocabularies (fixed expressions)


4. Text classification<br />

4.1. Starting point<br />

• Largest intersection |D ∩ T | between a <strong>domain</strong>-<strong>specific</strong> vocabulary<br />

D and the set <strong>of</strong> types T <strong>of</strong> the text to be classified.


4.2. Problems<br />

• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />

−→


4.2. Problems<br />

• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />

−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W


4.2. Problems<br />

• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />

−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W<br />

• Domains with large vocabularies are likely to get a better score<br />

−→


4.2. Problems<br />

• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />

−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W<br />

• Domains with large vocabularies are likely to get a better score<br />

−→ Size <strong>of</strong> a <strong>domain</strong>-<strong>specific</strong> vocabulary, |D|


4.2. Problems<br />

• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />

−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W<br />

• Domains with large vocabularies are likely to get a better score<br />

−→ Size <strong>of</strong> a <strong>domain</strong>-<strong>specific</strong> vocabulary, |D|<br />

• Function <strong>words</strong> may get too much weight<br />

−→


4.2. Problems<br />

• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />

−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W<br />

• Domains with large vocabularies are likely to get a better score<br />

−→ Size <strong>of</strong> a <strong>domain</strong>-<strong>specific</strong> vocabulary, |D|<br />

• Function <strong>words</strong> may get too much weight<br />

−→ Number <strong>of</strong> <strong>domain</strong>s a token is a member <strong>of</strong>


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />

5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />

5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />

6 consider text length (to make score relative) n (same as u + k)


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />

5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />

6 consider text length (to make score relative) n (same as u + k)<br />

sD =


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />

5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />

6 consider text length (to make score relative) n (same as u + k)<br />

sD = 1<br />

n


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />

5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />

6 consider text length (to make score relative) n (same as u + k)<br />

sD = 1 k<br />

·<br />

n u


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />

5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />

6 consider text length (to make score relative) n (same as u + k)<br />

sD = 1 k<br />

· · v<br />

n u


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />

5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />

6 consider text length (to make score relative) n (same as u + k)<br />

sD = 1<br />

n · k<br />

u · v · ∑<br />

t∈D∩W


4.3. To compute a score for a certain <strong>domain</strong>. . .<br />

1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />

2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />

|D|<br />

3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />

a member <strong>of</strong><br />

w = 1 d<br />

where d = ∑i |t ∩ Di|<br />

4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />

5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />

6 consider text length (to make score relative) n (same as u + k)<br />

sD = 1 k<br />

·<br />

n u · v · ∑ wt<br />

t∈D∩W


5. Comparison: determining <strong>new</strong> <strong>words</strong><br />

5.1. Procedure<br />

• Comparison <strong>of</strong> frequency pr<strong>of</strong>iles by log likelihood:


5. Comparison: determining <strong>new</strong> <strong>words</strong><br />

5.1. Procedure<br />

• Comparison <strong>of</strong> frequency pr<strong>of</strong>iles by log likelihood:<br />

– <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> material ←→ DDOC<br />

– salient vocabulary in <strong>new</strong> material =⇒ candidates for <strong>new</strong><br />

<strong>words</strong>


5.2. Example<br />

Du skal bruge en diskette til installationen. På et<br />

tidspunkt bliver du spurgt om du vil lave en<br />

bootdiskette. Erfaringen siger at det godt kan<br />

betale sig at formatere en diskette i forvejen med<br />

tjek for dårlige sektorer. Før du installerer Linux,<br />

skal der være en partition til rådighed, der er stor<br />

nok til at rumme det hele (samt en swap-partition).<br />

I løbet af Linux-installationen vil der blive lejlighed<br />

til at repartitionere så meget, du har behov for,<br />

inden for den plads, der nu er blevet til rådighed.<br />

You will need a diskette for the installation. At a<br />

point you will be asked if you want to create a boot<br />

diskette. Experience shows that it is worthwhile to<br />

format a diskette in advance with a check for bad<br />

sectors. Before you install Linux, there must be<br />

allocated a partition which is big enough to contain<br />

everything (as well as a swap partition). During the<br />

Linux installation there will be an opportunity to<br />

repartition as much as you need within the space<br />

which now has been made available.


5.3. Classification<br />

• The text is classified as a Computing text


5.3. Classification<br />

• The text is classified as a Computing text<br />

5.4. Comparison<br />

• OBS! The small size <strong>of</strong> the text may bias the result


5.3. Classification<br />

• The text is classified as a Computing text<br />

5.4. Comparison<br />

• OBS! The small size <strong>of</strong> the text may bias the result<br />

• However, the text is compared with the DDOC . . .


5.3. Classification<br />

• The text is classified as a Computing text<br />

5.4. Comparison<br />

• OBS! The small size <strong>of</strong> the text may bias the result<br />

• However, the text is compared with the DDOC . . .<br />

and its most salient <strong>words</strong> are listed as candidates


5.3. Classification<br />

• The text is classified as a Computing text<br />

5.4. Comparison<br />

• OBS! The small size <strong>of</strong> the text may bias the result<br />

• However, the text is compared with the DDOC . . .<br />

and its most salient <strong>words</strong> are listed as candidates<br />

together with DDO definition <strong>domain</strong> labels


5.5. New word candidates<br />

Type fDDOC fsample DDO <strong>domain</strong>s<br />

diskette 78 2 Computing<br />

bootdiskette 0 1 missing entry<br />

formatere ‘format’ 0 1 Computing<br />

linux 0 1 missing entry<br />

linux-installationen ‘the linux installation’ 0 1 missing entry<br />

⋆partition 0 1 missing entry<br />

repartitionere ‘repartition’ 0 1 missing entry<br />

swap-partition 0 1 missing entry


5.6. New sense candidates<br />

Type fDDOC fsample DDO <strong>domain</strong>s<br />

rådighed ‘disposition’ 1730 2 General<br />

⋆installerer ‘install(s)’ 16 1 General<br />

Technology<br />

du ‘you’ 143798 5 General<br />

⋆installationen ‘the installation’ 34 1 Technology<br />

Art<br />

Military<br />

tjek ‘check’ 100 1 General<br />

⋆sektorer ‘sectors’ 112 1 Society<br />

Politics<br />

Mathematics


6. Discussion, future, and conclusion<br />

Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />

scription.


6. Discussion, future, and conclusion<br />

Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />

scription.<br />

Approach:<br />

1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies


6. Discussion, future, and conclusion<br />

Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />

scription.<br />

Approach:<br />

1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />

2. Domain-<strong>specific</strong> vocabularies =⇒ text classification


6. Discussion, future, and conclusion<br />

Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />

scription.<br />

Approach:<br />

1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />

2. Domain-<strong>specific</strong> vocabularies =⇒ text classification<br />

3. Domain-classified <strong>new</strong> material ←→ original corpus


6. Discussion, future, and conclusion<br />

Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />

scription.<br />

Approach:<br />

1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />

2. Domain-<strong>specific</strong> vocabularies =⇒ text classification<br />

3. Domain-classified <strong>new</strong> material ←→ original corpus<br />

4. Salient <strong>words</strong> =⇒ candidates for <strong>new</strong> entries/definitions


6. Discussion, future, and conclusion<br />

Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />

scription.<br />

Approach:<br />

1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />

2. Domain-<strong>specific</strong> vocabularies =⇒ text classification<br />

3. Domain-classified <strong>new</strong> material ←→ original corpus<br />

4. Salient <strong>words</strong> =⇒ candidates for <strong>new</strong> entries/definitions<br />

Each <strong>of</strong> these steps involves some basic decisions which unintentionally<br />

may influence the outcome.


6.1. Basic decisions – and questions<br />

• DDOC <strong>domain</strong> classification<br />

– High number <strong>of</strong> <strong>domain</strong>s<br />

– Greatly varying amount <strong>of</strong> text under each <strong>domain</strong>


6.1. Basic decisions – and questions<br />

• DDOC <strong>domain</strong> classification<br />

– High number <strong>of</strong> <strong>domain</strong>s<br />

– Greatly varying amount <strong>of</strong> text under each <strong>domain</strong><br />

−→ Fewer <strong>domain</strong>s, less variance?<br />

• Significance test<br />

– Log likelihood


6.1. Basic decisions – and questions<br />

• DDOC <strong>domain</strong> classification<br />

– High number <strong>of</strong> <strong>domain</strong>s<br />

– Greatly varying amount <strong>of</strong> text under each <strong>domain</strong><br />

−→ Fewer <strong>domain</strong>s, less variance?<br />

• Significance test<br />

– Log likelihood<br />

−→ Better suited tests (e.g. Mann-Whitney ranks test)?


6.1. Basic decisions – and questions<br />

• DDOC <strong>domain</strong> classification<br />

– High number <strong>of</strong> <strong>domain</strong>s<br />

– Greatly varying amount <strong>of</strong> text under each <strong>domain</strong><br />

−→ Fewer <strong>domain</strong>s, less variance?<br />

• Significance test<br />

– Log likelihood<br />

−→ Better suited tests (e.g. Mann-Whitney ranks test)?<br />

−→ Are they reflecting the nature <strong>of</strong> the object investigated?


• Classification procedure<br />

– Reflect properties <strong>of</strong> the material to be processed


• Classification procedure<br />

– Reflect properties <strong>of</strong> the material to be processed<br />

∗ Token overlap between a text and a <strong>domain</strong> vocabulary<br />

∗ Size <strong>of</strong> the <strong>domain</strong>-<strong>specific</strong> vocabulary<br />

∗ Uniqueness <strong>of</strong> a certain type for a particular <strong>domain</strong><br />

∗ Ratio between recognised and unrecognised tokens


• Classification procedure<br />

– Reflect properties <strong>of</strong> the material to be processed<br />

∗ Token overlap between a text and a <strong>domain</strong> vocabulary<br />

∗ Size <strong>of</strong> the <strong>domain</strong>-<strong>specific</strong> vocabulary<br />

∗ Uniqueness <strong>of</strong> a certain type for a particular <strong>domain</strong><br />

∗ Ratio between recognised and unrecognised tokens<br />

−→ Other properties, e.g. salience rank?<br />

−→ Consequences <strong>of</strong> the properties being based on intuition?<br />

−→ Is the quantification appropriate?<br />

Yes, it seems to yield acceptable results<br />

No, it doesn’t explain nor reflect the nature <strong>of</strong> language


• Classification procedure<br />

– Reflect properties <strong>of</strong> the material to be processed<br />

∗ Token overlap between a text and a <strong>domain</strong> vocabulary<br />

∗ Size <strong>of</strong> the <strong>domain</strong>-<strong>specific</strong> vocabulary<br />

∗ Uniqueness <strong>of</strong> a certain type for a particular <strong>domain</strong><br />

∗ Ratio between recognised and unrecognised tokens<br />

−→ Other properties, e.g. salience rank?<br />

−→ Consequences <strong>of</strong> the properties being based on intuition?<br />

−→ Is the quantification appropriate?<br />

Yes, it seems to yield acceptable results<br />

No, it doesn’t explain nor reflect the nature <strong>of</strong> language<br />

−→ More appropriate classification approaches?


• Vocabulary extraction<br />

– Based on comparison <strong>of</strong> <strong>new</strong> material with the DDOC


• Vocabulary extraction<br />

– Based on comparison <strong>of</strong> <strong>new</strong> material with the DDOC<br />

−→ Compare with <strong>domain</strong>-<strong>specific</strong> DDOC subcorpus instead?


6.2. Testing in progress<br />

• Mutual dependencies between these decisions are complex<br />

−→ Testing <strong>of</strong> various alternating parameters


6.2. Testing in progress<br />

• Mutual dependencies between these decisions are complex<br />

−→ Testing <strong>of</strong> various alternating parameters<br />

• To test the text classification procedure:<br />


6.2. Testing in progress<br />

• Mutual dependencies between these decisions are complex<br />

−→ Testing <strong>of</strong> various alternating parameters<br />

• To test the text classification procedure:<br />

– Divide the DDOC into two parts<br />


6.2. Testing in progress<br />

• Mutual dependencies between these decisions are complex<br />

−→ Testing <strong>of</strong> various alternating parameters<br />

• To test the text classification procedure:<br />

– Divide the DDOC into two parts<br />

– Same relative amount <strong>of</strong> text material from each <strong>domain</strong><br />


6.2. Testing in progress<br />

• Mutual dependencies between these decisions are complex<br />

−→ Testing <strong>of</strong> various alternating parameters<br />

• To test the text classification procedure:<br />

– Divide the DDOC into two parts<br />

– Same relative amount <strong>of</strong> text material from each <strong>domain</strong><br />

– Part 1 =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />


6.2. Testing in progress<br />

• Mutual dependencies between these decisions are complex<br />

−→ Testing <strong>of</strong> various alternating parameters<br />

• To test the text classification procedure:<br />

– Divide the DDOC into two parts<br />

– Same relative amount <strong>of</strong> text material from each <strong>domain</strong><br />

– Part 1 =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />

– Part 2 =⇒ test various methodological alternatives


6.3. Future work<br />

• Domain-<strong>specific</strong> vocabularies develop and change over time


6.3. Future work<br />

• Domain-<strong>specific</strong> vocabularies develop and change over time<br />

– Consider <strong>new</strong>ly determined <strong>domain</strong> <strong>words</strong><br />


6.3. Future work<br />

• Domain-<strong>specific</strong> vocabularies develop and change over time<br />

– Consider <strong>new</strong>ly determined <strong>domain</strong> <strong>words</strong><br />

∗ Possible approaches:<br />

∗<br />

· add <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary to the already<br />

existing<br />

· add <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> material to the DDOC<br />

and re-run procedure


6.3. Future work<br />

• Domain-<strong>specific</strong> vocabularies develop and change over time<br />

– Consider <strong>new</strong>ly determined <strong>domain</strong> <strong>words</strong><br />

∗ Possible approaches:<br />

· add <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary to the already<br />

existing<br />

· add <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> material to the DDOC<br />

and re-run procedure<br />

∗ Tests must show which approach performs best


6.4. Conclusion<br />

Good:<br />

• The mechanism is applicable for the described task


6.4. Conclusion<br />

Good:<br />

• The mechanism is applicable for the described task<br />

Bad:<br />

• No explanation <strong>of</strong> what makes a word nor a text <strong>domain</strong>-<strong>specific</strong><br />

• No explanation <strong>of</strong> what makes a word <strong>new</strong>


6.4. Conclusion<br />

Good:<br />

• The mechanism is applicable for the described task<br />

Bad:<br />

• No explanation <strong>of</strong> what makes a word nor a text <strong>domain</strong>-<strong>specific</strong><br />

• No explanation <strong>of</strong> what makes a word <strong>new</strong><br />

Well . . .<br />

• Even if it is quantitative. . .


6.4. Conclusion<br />

Good:<br />

• The mechanism is applicable for the described task<br />

Bad:<br />

• No explanation <strong>of</strong> what makes a word nor a text <strong>domain</strong>-<strong>specific</strong><br />

• No explanation <strong>of</strong> what makes a word <strong>new</strong><br />

Well . . .<br />

• Even if it is quantitative. . .<br />

– It is still based on human intuition about language


6.4. Conclusion<br />

Good:<br />

• The mechanism is applicable for the described task<br />

Bad:<br />

• No explanation <strong>of</strong> what makes a word nor a text <strong>domain</strong>-<strong>specific</strong><br />

• No explanation <strong>of</strong> what makes a word <strong>new</strong><br />

Well . . .<br />

• Even if it is quantitative. . .<br />

– It is still based on human intuition about language<br />

– But are not most quantitative approaches in linguistics?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!