Automatic detection of new domain-specific words, using document ...
Automatic detection of new domain-specific words, using document ...
Automatic detection of new domain-specific words, using document ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Automatic</strong> <strong>detection</strong><br />
<strong>of</strong> <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong>,<br />
<strong>using</strong> <strong>document</strong> classification<br />
and frequency pr<strong>of</strong>iling<br />
—<br />
Jørg Asmussen<br />
Department for Digital Dictionaries and Text Corpora<br />
Society for Danish Language and Literature, DSL<br />
ja@dsl.dk<br />
—<br />
Corpus Linguistics 2005
Contents<br />
1 Background: updating the DDO 4<br />
1.1 Source <strong>of</strong> <strong>new</strong> vocabulary . . . . . . . . . . . . . . . . 5<br />
1.2 Updating task . . . . . . . . . . . . . . . . . . . . . . 5<br />
1.3 Prerequisites for the updating task . . . . . . . . . . . 6<br />
2 Data: the DDO Corpus 7<br />
3 Analysis: deriving <strong>domain</strong> vocabularies 8<br />
3.1 Examples: most salient types for 3 <strong>domain</strong>s . . . . . . 9<br />
3.2 Quantitative problems with the approach . . . . . . . . 10<br />
3.3 Qualitative problems with the approach . . . . . . . . 11<br />
4 Text classification 12<br />
4.1 Starting point . . . . . . . . . . . . . . . . . . . . . . 12<br />
4.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 To compute a score for a certain <strong>domain</strong>. . . . . . . . . 14<br />
5 Comparison: determining <strong>new</strong> <strong>words</strong> 15<br />
5.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
5.3 Classification . . . . . . . . . . . . . . . . . . . . . . 17<br />
5.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . 17<br />
5.5 New word candidates . . . . . . . . . . . . . . . . . . 18<br />
5.6 New sense candidates . . . . . . . . . . . . . . . . . . 19<br />
6 Discussion, future, and conclusion 20<br />
6.1 Basic decisions – and questions . . . . . . . . . . . . . 21<br />
6.2 Testing in progress . . . . . . . . . . . . . . . . . . . 24<br />
6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . 25<br />
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 26
1. Background: updating the DDO
1. Background: updating the DDO<br />
• Corpus-based dictionary <strong>of</strong> modern Danish
1. Background: updating the DDO<br />
• Corpus-based dictionary <strong>of</strong> modern Danish<br />
• Published 2003-2005 in six volumes
1. Background: updating the DDO<br />
• Corpus-based dictionary <strong>of</strong> modern Danish<br />
• Published 2003-2005 in six volumes<br />
• Enhanced web-based version under preparation: ordnet.dk
1. Background: updating the DDO<br />
• Corpus-based dictionary <strong>of</strong> modern Danish<br />
• Published 2003-2005 in six volumes<br />
• Enhanced web-based version under preparation: ordnet.dk<br />
Enhanced =⇒ continuously updated with <strong>new</strong> vocabulary
1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />
• Mainly <strong>new</strong>spaper material<br />
– drawn from the Danish media database, www.infomedia.dk
1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />
• Mainly <strong>new</strong>spaper material<br />
– drawn from the Danish media database, www.infomedia.dk<br />
1.2. Updating task<br />
• Prior to lexicographic description, <strong>new</strong> vocabulary is subdivided<br />
into <strong>domain</strong>s
1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />
• Mainly <strong>new</strong>spaper material<br />
– drawn from the Danish media database, www.infomedia.dk<br />
1.2. Updating task<br />
• Prior to lexicographic description, <strong>new</strong> vocabulary is subdivided<br />
into <strong>domain</strong>s<br />
−→ Assign incoming texts to particular <strong>domain</strong>s
1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />
• Mainly <strong>new</strong>spaper material<br />
– drawn from the Danish media database, www.infomedia.dk<br />
1.2. Updating task<br />
• Prior to lexicographic description, <strong>new</strong> vocabulary is subdivided<br />
into <strong>domain</strong>s<br />
−→ Assign incoming texts to particular <strong>domain</strong>s<br />
−→ Extract from them <strong>words</strong> so far not listed in the dictionary
1.1. Source <strong>of</strong> <strong>new</strong> vocabulary<br />
• Mainly <strong>new</strong>spaper material<br />
– drawn from the Danish media database, www.infomedia.dk<br />
1.2. Updating task<br />
• Prior to lexicographic description, <strong>new</strong> vocabulary is subdivided<br />
into <strong>domain</strong>s<br />
−→ Assign incoming texts to particular <strong>domain</strong>s<br />
−→ Extract from them <strong>words</strong> so far not listed in the dictionary<br />
−→ Treat them as candidates for inclusion
1.3. Prerequisites for the updating task<br />
1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:
1.3. Prerequisites for the updating task<br />
1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:<br />
(a) granularity – how many different <strong>domain</strong>s?<br />
(b) contents – how to define a certain <strong>domain</strong> intensionally?<br />
−→
1.3. Prerequisites for the updating task<br />
1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:<br />
(a) granularity – how many different <strong>domain</strong>s?<br />
(b) contents – how to define a certain <strong>domain</strong> intensionally?<br />
−→ Decimal classification system DK5 (cf. Dewey’s)
1.3. Prerequisites for the updating task<br />
1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:<br />
(a) granularity – how many different <strong>domain</strong>s?<br />
(b) contents – how to define a certain <strong>domain</strong> intensionally?<br />
−→ Decimal classification system DK5 (cf. Dewey’s)<br />
2. Classification procedure – how to assign a text to a <strong>domain</strong>?<br />
−→
1.3. Prerequisites for the updating task<br />
1. Design <strong>of</strong> a suitable <strong>domain</strong> classification:<br />
(a) granularity – how many different <strong>domain</strong>s?<br />
(b) contents – how to define a certain <strong>domain</strong> intensionally?<br />
−→ Decimal classification system DK5 (cf. Dewey’s)<br />
2. Classification procedure – how to assign a text to a <strong>domain</strong>?<br />
−→ Statistical, based on the DDOC
2. Data: the DDO Corpus<br />
The Corpus <strong>of</strong> the Danish Dictionary, DDOC:
2. Data: the DDO Corpus<br />
The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />
• 43000 text samples totalling 40 million <strong>words</strong>
2. Data: the DDO Corpus<br />
The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />
• 43000 text samples totalling 40 million <strong>words</strong><br />
• Compiled by DSL 1991-93
2. Data: the DDO Corpus<br />
The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />
• 43000 text samples totalling 40 million <strong>words</strong><br />
• Compiled by DSL 1991-93<br />
• Broad coverage <strong>of</strong> Danish language from 1983-1992
2. Data: the DDO Corpus<br />
The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />
• 43000 text samples totalling 40 million <strong>words</strong><br />
• Compiled by DSL 1991-93<br />
• Broad coverage <strong>of</strong> Danish language from 1983-1992<br />
• 88.6% <strong>of</strong> the text samples are assigned to one <strong>of</strong> 66 <strong>domain</strong>s
2. Data: the DDO Corpus<br />
The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />
• 43000 text samples totalling 40 million <strong>words</strong><br />
• Compiled by DSL 1991-93<br />
• Broad coverage <strong>of</strong> Danish language from 1983-1992<br />
• 88.6% <strong>of</strong> the text samples are assigned to one <strong>of</strong> 66 <strong>domain</strong>s<br />
From this material, we will
2. Data: the DDO Corpus<br />
The Corpus <strong>of</strong> the Danish Dictionary, DDOC:<br />
• 43000 text samples totalling 40 million <strong>words</strong><br />
• Compiled by DSL 1991-93<br />
• Broad coverage <strong>of</strong> Danish language from 1983-1992<br />
• 88.6% <strong>of</strong> the text samples are assigned to one <strong>of</strong> 66 <strong>domain</strong>s<br />
From this material, we will<br />
• Extract 66 <strong>domain</strong>-<strong>specific</strong> vocabularies<br />
• Use these vocabularies to classify unseen text
3. Analysis: deriving <strong>domain</strong> vocabularies<br />
1. Creation <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />
DDOC <strong>domain</strong> codes =⇒ 66 <strong>domain</strong>-<strong>specific</strong> subcorpora
3. Analysis: deriving <strong>domain</strong> vocabularies<br />
1. Creation <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />
DDOC <strong>domain</strong> codes =⇒ 66 <strong>domain</strong>-<strong>specific</strong> subcorpora<br />
2. Creation <strong>of</strong> frequency pr<strong>of</strong>iles:<br />
DDOC + subcopora =⇒ frequency pr<strong>of</strong>iles
3. Analysis: deriving <strong>domain</strong> vocabularies<br />
1. Creation <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />
DDOC <strong>domain</strong> codes =⇒ 66 <strong>domain</strong>-<strong>specific</strong> subcorpora<br />
2. Creation <strong>of</strong> frequency pr<strong>of</strong>iles:<br />
DDOC + subcopora =⇒ frequency pr<strong>of</strong>iles<br />
3. Comparison <strong>of</strong> frequency pr<strong>of</strong>iles:<br />
• Each <strong>of</strong> the 66 frequency pr<strong>of</strong>iles is compared with the frequency<br />
pr<strong>of</strong>ile <strong>of</strong> the total DDOC. Significance test: log<br />
likelihood.<br />
• Over-represented types (p ≥ 0.99) within a <strong>domain</strong> yield a<br />
<strong>domain</strong>-<strong>specific</strong> vocabulary D.
3.1. Examples: most salient types for 3 <strong>domain</strong>s<br />
D computing D philosophy Deconomy<br />
data mennesket ‘man’ kr Danish unit <strong>of</strong> currency (abbreviated)<br />
programmer ‘programs’ kierkegaard X,X amount (with decimals)<br />
computer moral pct<br />
computeren ‘the computer’ løgstrup Danish philosopher procent ‘percent’<br />
edb ‘computing’ aristoteles kroner Danish unit <strong>of</strong> currency<br />
computere ‘computers’ filos<strong>of</strong>i rente ‘interest’<br />
ibm fornuft ‘ratio’ <strong>of</strong>fentlige ‘public’<br />
pc platon økonomiske ‘economic’<br />
kan ‘can’ kierkegaards bank<br />
mb tim probably a name X figure, number<br />
apple den ‘the’ økonomi ‘economy’<br />
amiga menneskets ‘man’s’ vil ‘will, shall’<br />
commodore filos<strong>of</strong> ‘philosopher’ mia
3.2. Quantitative problems with the approach<br />
Arbitrary significance level (p ≥ 0.99):<br />
• influences the number <strong>of</strong> types in each <strong>domain</strong>
3.2. Quantitative problems with the approach<br />
Arbitrary significance level (p ≥ 0.99):<br />
• influences the number <strong>of</strong> types in each <strong>domain</strong><br />
Differently sized <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />
• causes the according vocabularies to be <strong>of</strong> different size:<br />
Folklore <strong>domain</strong>: 1957 types<br />
Sport <strong>domain</strong>: 16022 types<br />
Average for all <strong>domain</strong>s: 7256 types
3.2. Quantitative problems with the approach<br />
Arbitrary significance level (p ≥ 0.99):<br />
• influences the number <strong>of</strong> types in each <strong>domain</strong><br />
Differently sized <strong>domain</strong>-<strong>specific</strong> subcorpora:<br />
• causes the according vocabularies to be <strong>of</strong> different size:<br />
Folklore <strong>domain</strong>: 1957 types<br />
Sport <strong>domain</strong>: 16022 types<br />
Average for all <strong>domain</strong>s: 7256 types<br />
−→ Take into account the size <strong>of</strong> these vocabularies
3.3. Qualitative problems with the approach<br />
2. Frequent function <strong>words</strong> appear saliently ranked, e.g.,<br />
den (‘the’)<br />
vil (‘will’)
3.3. Qualitative problems with the approach<br />
2. Frequent function <strong>words</strong> appear saliently ranked, e.g.,<br />
den (‘the’)<br />
vil (‘will’)<br />
−→ Not excluded from the vocabularies (fixed expressions)
4. Text classification<br />
4.1. Starting point<br />
• Largest intersection |D ∩ T | between a <strong>domain</strong>-<strong>specific</strong> vocabulary<br />
D and the set <strong>of</strong> types T <strong>of</strong> the text to be classified.
4.2. Problems<br />
• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />
−→
4.2. Problems<br />
• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />
−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W
4.2. Problems<br />
• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />
−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W<br />
• Domains with large vocabularies are likely to get a better score<br />
−→
4.2. Problems<br />
• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />
−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W<br />
• Domains with large vocabularies are likely to get a better score<br />
−→ Size <strong>of</strong> a <strong>domain</strong>-<strong>specific</strong> vocabulary, |D|
4.2. Problems<br />
• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />
−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W<br />
• Domains with large vocabularies are likely to get a better score<br />
−→ Size <strong>of</strong> a <strong>domain</strong>-<strong>specific</strong> vocabulary, |D|<br />
• Function <strong>words</strong> may get too much weight<br />
−→
4.2. Problems<br />
• Highly frequent <strong>domain</strong>-<strong>specific</strong> types in a text only count 1<br />
−→ Largest intersection <strong>of</strong> D and the set <strong>of</strong> text tokens, W<br />
• Domains with large vocabularies are likely to get a better score<br />
−→ Size <strong>of</strong> a <strong>domain</strong>-<strong>specific</strong> vocabulary, |D|<br />
• Function <strong>words</strong> may get too much weight<br />
−→ Number <strong>of</strong> <strong>domain</strong>s a token is a member <strong>of</strong>
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />
5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />
5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />
6 consider text length (to make score relative) n (same as u + k)
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />
5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />
6 consider text length (to make score relative) n (same as u + k)<br />
sD =
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />
5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />
6 consider text length (to make score relative) n (same as u + k)<br />
sD = 1<br />
n
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />
5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />
6 consider text length (to make score relative) n (same as u + k)<br />
sD = 1 k<br />
·<br />
n u
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />
5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />
6 consider text length (to make score relative) n (same as u + k)<br />
sD = 1 k<br />
· · v<br />
n u
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />
5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />
6 consider text length (to make score relative) n (same as u + k)<br />
sD = 1<br />
n · k<br />
u · v · ∑<br />
t∈D∩W
4.3. To compute a score for a certain <strong>domain</strong>. . .<br />
1 count text tokens that are a member <strong>of</strong> the <strong>domain</strong> t ∈ D ∩W<br />
2 weigh count by size <strong>of</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary v = 1 √<br />
|D|<br />
3 weigh score by number <strong>of</strong> <strong>domain</strong>s the text token is<br />
a member <strong>of</strong><br />
w = 1 d<br />
where d = ∑i |t ∩ Di|<br />
4 consider number <strong>of</strong> ‘unknown’ text tokens u (same as n − k)<br />
5 consider number <strong>of</strong> ‘known’ text tokens k (same as n − u)<br />
6 consider text length (to make score relative) n (same as u + k)<br />
sD = 1 k<br />
·<br />
n u · v · ∑ wt<br />
t∈D∩W
5. Comparison: determining <strong>new</strong> <strong>words</strong><br />
5.1. Procedure<br />
• Comparison <strong>of</strong> frequency pr<strong>of</strong>iles by log likelihood:
5. Comparison: determining <strong>new</strong> <strong>words</strong><br />
5.1. Procedure<br />
• Comparison <strong>of</strong> frequency pr<strong>of</strong>iles by log likelihood:<br />
– <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> material ←→ DDOC<br />
– salient vocabulary in <strong>new</strong> material =⇒ candidates for <strong>new</strong><br />
<strong>words</strong>
5.2. Example<br />
Du skal bruge en diskette til installationen. På et<br />
tidspunkt bliver du spurgt om du vil lave en<br />
bootdiskette. Erfaringen siger at det godt kan<br />
betale sig at formatere en diskette i forvejen med<br />
tjek for dårlige sektorer. Før du installerer Linux,<br />
skal der være en partition til rådighed, der er stor<br />
nok til at rumme det hele (samt en swap-partition).<br />
I løbet af Linux-installationen vil der blive lejlighed<br />
til at repartitionere så meget, du har behov for,<br />
inden for den plads, der nu er blevet til rådighed.<br />
You will need a diskette for the installation. At a<br />
point you will be asked if you want to create a boot<br />
diskette. Experience shows that it is worthwhile to<br />
format a diskette in advance with a check for bad<br />
sectors. Before you install Linux, there must be<br />
allocated a partition which is big enough to contain<br />
everything (as well as a swap partition). During the<br />
Linux installation there will be an opportunity to<br />
repartition as much as you need within the space<br />
which now has been made available.
5.3. Classification<br />
• The text is classified as a Computing text
5.3. Classification<br />
• The text is classified as a Computing text<br />
5.4. Comparison<br />
• OBS! The small size <strong>of</strong> the text may bias the result
5.3. Classification<br />
• The text is classified as a Computing text<br />
5.4. Comparison<br />
• OBS! The small size <strong>of</strong> the text may bias the result<br />
• However, the text is compared with the DDOC . . .
5.3. Classification<br />
• The text is classified as a Computing text<br />
5.4. Comparison<br />
• OBS! The small size <strong>of</strong> the text may bias the result<br />
• However, the text is compared with the DDOC . . .<br />
and its most salient <strong>words</strong> are listed as candidates
5.3. Classification<br />
• The text is classified as a Computing text<br />
5.4. Comparison<br />
• OBS! The small size <strong>of</strong> the text may bias the result<br />
• However, the text is compared with the DDOC . . .<br />
and its most salient <strong>words</strong> are listed as candidates<br />
together with DDO definition <strong>domain</strong> labels
5.5. New word candidates<br />
Type fDDOC fsample DDO <strong>domain</strong>s<br />
diskette 78 2 Computing<br />
bootdiskette 0 1 missing entry<br />
formatere ‘format’ 0 1 Computing<br />
linux 0 1 missing entry<br />
linux-installationen ‘the linux installation’ 0 1 missing entry<br />
⋆partition 0 1 missing entry<br />
repartitionere ‘repartition’ 0 1 missing entry<br />
swap-partition 0 1 missing entry
5.6. New sense candidates<br />
Type fDDOC fsample DDO <strong>domain</strong>s<br />
rådighed ‘disposition’ 1730 2 General<br />
⋆installerer ‘install(s)’ 16 1 General<br />
Technology<br />
du ‘you’ 143798 5 General<br />
⋆installationen ‘the installation’ 34 1 Technology<br />
Art<br />
Military<br />
tjek ‘check’ 100 1 General<br />
⋆sektorer ‘sectors’ 112 1 Society<br />
Politics<br />
Mathematics
6. Discussion, future, and conclusion<br />
Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />
scription.
6. Discussion, future, and conclusion<br />
Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />
scription.<br />
Approach:<br />
1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies
6. Discussion, future, and conclusion<br />
Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />
scription.<br />
Approach:<br />
1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />
2. Domain-<strong>specific</strong> vocabularies =⇒ text classification
6. Discussion, future, and conclusion<br />
Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />
scription.<br />
Approach:<br />
1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />
2. Domain-<strong>specific</strong> vocabularies =⇒ text classification<br />
3. Domain-classified <strong>new</strong> material ←→ original corpus
6. Discussion, future, and conclusion<br />
Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />
scription.<br />
Approach:<br />
1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />
2. Domain-<strong>specific</strong> vocabularies =⇒ text classification<br />
3. Domain-classified <strong>new</strong> material ←→ original corpus<br />
4. Salient <strong>words</strong> =⇒ candidates for <strong>new</strong> entries/definitions
6. Discussion, future, and conclusion<br />
Task: Determine <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> <strong>words</strong> for lexicographic de-<br />
scription.<br />
Approach:<br />
1. Corpus =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />
2. Domain-<strong>specific</strong> vocabularies =⇒ text classification<br />
3. Domain-classified <strong>new</strong> material ←→ original corpus<br />
4. Salient <strong>words</strong> =⇒ candidates for <strong>new</strong> entries/definitions<br />
Each <strong>of</strong> these steps involves some basic decisions which unintentionally<br />
may influence the outcome.
6.1. Basic decisions – and questions<br />
• DDOC <strong>domain</strong> classification<br />
– High number <strong>of</strong> <strong>domain</strong>s<br />
– Greatly varying amount <strong>of</strong> text under each <strong>domain</strong>
6.1. Basic decisions – and questions<br />
• DDOC <strong>domain</strong> classification<br />
– High number <strong>of</strong> <strong>domain</strong>s<br />
– Greatly varying amount <strong>of</strong> text under each <strong>domain</strong><br />
−→ Fewer <strong>domain</strong>s, less variance?<br />
• Significance test<br />
– Log likelihood
6.1. Basic decisions – and questions<br />
• DDOC <strong>domain</strong> classification<br />
– High number <strong>of</strong> <strong>domain</strong>s<br />
– Greatly varying amount <strong>of</strong> text under each <strong>domain</strong><br />
−→ Fewer <strong>domain</strong>s, less variance?<br />
• Significance test<br />
– Log likelihood<br />
−→ Better suited tests (e.g. Mann-Whitney ranks test)?
6.1. Basic decisions – and questions<br />
• DDOC <strong>domain</strong> classification<br />
– High number <strong>of</strong> <strong>domain</strong>s<br />
– Greatly varying amount <strong>of</strong> text under each <strong>domain</strong><br />
−→ Fewer <strong>domain</strong>s, less variance?<br />
• Significance test<br />
– Log likelihood<br />
−→ Better suited tests (e.g. Mann-Whitney ranks test)?<br />
−→ Are they reflecting the nature <strong>of</strong> the object investigated?
• Classification procedure<br />
– Reflect properties <strong>of</strong> the material to be processed
• Classification procedure<br />
– Reflect properties <strong>of</strong> the material to be processed<br />
∗ Token overlap between a text and a <strong>domain</strong> vocabulary<br />
∗ Size <strong>of</strong> the <strong>domain</strong>-<strong>specific</strong> vocabulary<br />
∗ Uniqueness <strong>of</strong> a certain type for a particular <strong>domain</strong><br />
∗ Ratio between recognised and unrecognised tokens
• Classification procedure<br />
– Reflect properties <strong>of</strong> the material to be processed<br />
∗ Token overlap between a text and a <strong>domain</strong> vocabulary<br />
∗ Size <strong>of</strong> the <strong>domain</strong>-<strong>specific</strong> vocabulary<br />
∗ Uniqueness <strong>of</strong> a certain type for a particular <strong>domain</strong><br />
∗ Ratio between recognised and unrecognised tokens<br />
−→ Other properties, e.g. salience rank?<br />
−→ Consequences <strong>of</strong> the properties being based on intuition?<br />
−→ Is the quantification appropriate?<br />
Yes, it seems to yield acceptable results<br />
No, it doesn’t explain nor reflect the nature <strong>of</strong> language
• Classification procedure<br />
– Reflect properties <strong>of</strong> the material to be processed<br />
∗ Token overlap between a text and a <strong>domain</strong> vocabulary<br />
∗ Size <strong>of</strong> the <strong>domain</strong>-<strong>specific</strong> vocabulary<br />
∗ Uniqueness <strong>of</strong> a certain type for a particular <strong>domain</strong><br />
∗ Ratio between recognised and unrecognised tokens<br />
−→ Other properties, e.g. salience rank?<br />
−→ Consequences <strong>of</strong> the properties being based on intuition?<br />
−→ Is the quantification appropriate?<br />
Yes, it seems to yield acceptable results<br />
No, it doesn’t explain nor reflect the nature <strong>of</strong> language<br />
−→ More appropriate classification approaches?
• Vocabulary extraction<br />
– Based on comparison <strong>of</strong> <strong>new</strong> material with the DDOC
• Vocabulary extraction<br />
– Based on comparison <strong>of</strong> <strong>new</strong> material with the DDOC<br />
−→ Compare with <strong>domain</strong>-<strong>specific</strong> DDOC subcorpus instead?
6.2. Testing in progress<br />
• Mutual dependencies between these decisions are complex<br />
−→ Testing <strong>of</strong> various alternating parameters
6.2. Testing in progress<br />
• Mutual dependencies between these decisions are complex<br />
−→ Testing <strong>of</strong> various alternating parameters<br />
• To test the text classification procedure:<br />
–
6.2. Testing in progress<br />
• Mutual dependencies between these decisions are complex<br />
−→ Testing <strong>of</strong> various alternating parameters<br />
• To test the text classification procedure:<br />
– Divide the DDOC into two parts<br />
–
6.2. Testing in progress<br />
• Mutual dependencies between these decisions are complex<br />
−→ Testing <strong>of</strong> various alternating parameters<br />
• To test the text classification procedure:<br />
– Divide the DDOC into two parts<br />
– Same relative amount <strong>of</strong> text material from each <strong>domain</strong><br />
–
6.2. Testing in progress<br />
• Mutual dependencies between these decisions are complex<br />
−→ Testing <strong>of</strong> various alternating parameters<br />
• To test the text classification procedure:<br />
– Divide the DDOC into two parts<br />
– Same relative amount <strong>of</strong> text material from each <strong>domain</strong><br />
– Part 1 =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />
–
6.2. Testing in progress<br />
• Mutual dependencies between these decisions are complex<br />
−→ Testing <strong>of</strong> various alternating parameters<br />
• To test the text classification procedure:<br />
– Divide the DDOC into two parts<br />
– Same relative amount <strong>of</strong> text material from each <strong>domain</strong><br />
– Part 1 =⇒ <strong>domain</strong>-<strong>specific</strong> vocabularies<br />
– Part 2 =⇒ test various methodological alternatives
6.3. Future work<br />
• Domain-<strong>specific</strong> vocabularies develop and change over time
6.3. Future work<br />
• Domain-<strong>specific</strong> vocabularies develop and change over time<br />
– Consider <strong>new</strong>ly determined <strong>domain</strong> <strong>words</strong><br />
∗
6.3. Future work<br />
• Domain-<strong>specific</strong> vocabularies develop and change over time<br />
– Consider <strong>new</strong>ly determined <strong>domain</strong> <strong>words</strong><br />
∗ Possible approaches:<br />
∗<br />
· add <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary to the already<br />
existing<br />
· add <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> material to the DDOC<br />
and re-run procedure
6.3. Future work<br />
• Domain-<strong>specific</strong> vocabularies develop and change over time<br />
– Consider <strong>new</strong>ly determined <strong>domain</strong> <strong>words</strong><br />
∗ Possible approaches:<br />
· add <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> vocabulary to the already<br />
existing<br />
· add <strong>new</strong> <strong>domain</strong>-<strong>specific</strong> material to the DDOC<br />
and re-run procedure<br />
∗ Tests must show which approach performs best
6.4. Conclusion<br />
Good:<br />
• The mechanism is applicable for the described task
6.4. Conclusion<br />
Good:<br />
• The mechanism is applicable for the described task<br />
Bad:<br />
• No explanation <strong>of</strong> what makes a word nor a text <strong>domain</strong>-<strong>specific</strong><br />
• No explanation <strong>of</strong> what makes a word <strong>new</strong>
6.4. Conclusion<br />
Good:<br />
• The mechanism is applicable for the described task<br />
Bad:<br />
• No explanation <strong>of</strong> what makes a word nor a text <strong>domain</strong>-<strong>specific</strong><br />
• No explanation <strong>of</strong> what makes a word <strong>new</strong><br />
Well . . .<br />
• Even if it is quantitative. . .
6.4. Conclusion<br />
Good:<br />
• The mechanism is applicable for the described task<br />
Bad:<br />
• No explanation <strong>of</strong> what makes a word nor a text <strong>domain</strong>-<strong>specific</strong><br />
• No explanation <strong>of</strong> what makes a word <strong>new</strong><br />
Well . . .<br />
• Even if it is quantitative. . .<br />
– It is still based on human intuition about language
6.4. Conclusion<br />
Good:<br />
• The mechanism is applicable for the described task<br />
Bad:<br />
• No explanation <strong>of</strong> what makes a word nor a text <strong>domain</strong>-<strong>specific</strong><br />
• No explanation <strong>of</strong> what makes a word <strong>new</strong><br />
Well . . .<br />
• Even if it is quantitative. . .<br />
– It is still based on human intuition about language<br />
– But are not most quantitative approaches in linguistics?