Automatic detection of new domain-specific words, using document ...

Automatic detection 

of new domain-specific words, 

using document classification 

and frequency profiling 

— 

Jørg Asmussen 

Department for Digital Dictionaries and Text Corpora 

Society for Danish Language and Literature, DSL 

ja@dsl.dk 

— 

Corpus Linguistics 2005

Contents 

1 Background: updating the DDO 4 

1.1 Source of new vocabulary . . . . . . . . . . . . . . . . 5 

1.2 Updating task . . . . . . . . . . . . . . . . . . . . . . 5 

1.3 Prerequisites for the updating task . . . . . . . . . . . 6 

2 Data: the DDO Corpus 7 

3 Analysis: deriving domain vocabularies 8 

3.1 Examples: most salient types for 3 domains . . . . . . 9 

3.2 Quantitative problems with the approach . . . . . . . . 10 

3.3 Qualitative problems with the approach . . . . . . . . 11 

4 Text classification 12 

4.1 Starting point . . . . . . . . . . . . . . . . . . . . . . 12 

4.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 To compute a score for a certain domain. . . . . . . . . 14 

5 Comparison: determining new words 15 

5.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . 15 

5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . 16 

5.3 Classification . . . . . . . . . . . . . . . . . . . . . . 17 

5.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . 17 

5.5 New word candidates . . . . . . . . . . . . . . . . . . 18 

5.6 New sense candidates . . . . . . . . . . . . . . . . . . 19 

6 Discussion, future, and conclusion 20 

6.1 Basic decisions – and questions . . . . . . . . . . . . . 21 

6.2 Testing in progress . . . . . . . . . . . . . . . . . . . 24 

6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . 25 

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 26

1. Background: updating the DDO

1. Background: updating the DDO 

• Corpus-based dictionary of modern Danish


• Corpus-based dictionary of modern Danish 

• Published 2003-2005 in six volumes



• Published 2003-2005 in six volumes 

• Enhanced web-based version under preparation: ordnet.dk



• Published 2003-2005 in six volumes 

• Enhanced web-based version under preparation: ordnet.dk 

Enhanced =⇒ continuously updated with new vocabulary

1.1. Source of new vocabulary 

• Mainly newspaper material 

– drawn from the Danish media database, www.infomedia.dk



– drawn from the Danish media database, www.infomedia.dk 

1.2. Updating task 

• Prior to lexicographic description, new vocabulary is subdivided 

into domains






into domains 

−→ Assign incoming texts to particular domains







−→ Assign incoming texts to particular domains 

−→ Extract from them words so far not listed in the dictionary







−→ Assign incoming texts to particular domains 

−→ Extract from them words so far not listed in the dictionary 

−→ Treat them as candidates for inclusion

1.3. Prerequisites for the updating task 

1. Design of a suitable domain classification:


1. Design of a suitable domain classification: 

(a) granularity – how many different domains? 

(b) contents – how to define a certain domain intensionally? 

−→





−→ Decimal classification system DK5 (cf. Dewey’s)





−→ Decimal classification system DK5 (cf. Dewey’s) 

2. Classification procedure – how to assign a text to a domain? 

−→





−→ Decimal classification system DK5 (cf. Dewey’s) 

2. Classification procedure – how to assign a text to a domain? 

−→ Statistical, based on the DDOC

2. Data: the DDO Corpus 

The Corpus of the Danish Dictionary, DDOC:


The Corpus of the Danish Dictionary, DDOC: 

• 43000 text samples totalling 40 million words



• 43000 text samples totalling 40 million words 

• Compiled by DSL 1991-93




• Compiled by DSL 1991-93 

• Broad coverage of Danish language from 1983-1992





• Broad coverage of Danish language from 1983-1992 

• 88.6% of the text samples are assigned to one of 66 domains






• 88.6% of the text samples are assigned to one of 66 domains 

From this material, we will






• 88.6% of the text samples are assigned to one of 66 domains 

From this material, we will 

• Extract 66 domain-specific vocabularies 

• Use these vocabularies to classify unseen text

3. Analysis: deriving domain vocabularies 

1. Creation of domain-specific subcorpora: 

DDOC domain codes =⇒ 66 domain-specific subcorpora



DDOC domain codes =⇒ 66 domain-specific subcorpora 

2. Creation of frequency profiles: 

DDOC + subcopora =⇒ frequency profiles



DDOC domain codes =⇒ 66 domain-specific subcorpora 

2. Creation of frequency profiles: 

DDOC + subcopora =⇒ frequency profiles 

3. Comparison of frequency profiles: 

• Each of the 66 frequency profiles is compared with the frequency 

profile of the total DDOC. Significance test: log 

likelihood. 

• Over-represented types (p ≥ 0.99) within a domain yield a 

domain-specific vocabulary D.

3.1. Examples: most salient types for 3 domains 

D computing D philosophy Deconomy 

data mennesket ‘man’ kr Danish unit of currency (abbreviated) 

programmer ‘programs’ kierkegaard X,X amount (with decimals) 

computer moral pct 

computeren ‘the computer’ løgstrup Danish philosopher procent ‘percent’ 

edb ‘computing’ aristoteles kroner Danish unit of currency 

computere ‘computers’ filosofi rente ‘interest’ 

ibm fornuft ‘ratio’ offentlige ‘public’ 

pc platon økonomiske ‘economic’ 

kan ‘can’ kierkegaards bank 

mb tim probably a name X figure, number 

apple den ‘the’ økonomi ‘economy’ 

amiga menneskets ‘man’s’ vil ‘will, shall’ 

commodore filosof ‘philosopher’ mia

3.2. Quantitative problems with the approach 

Arbitrary significance level (p ≥ 0.99): 

• influences the number of types in each domain



• influences the number of types in each domain 

Differently sized domain-specific subcorpora: 

• causes the according vocabularies to be of different size: 

Folklore domain: 1957 types 

Sport domain: 16022 types 

Average for all domains: 7256 types



• influences the number of types in each domain 

Differently sized domain-specific subcorpora: 

• causes the according vocabularies to be of different size: 

Folklore domain: 1957 types 

Sport domain: 16022 types 

Average for all domains: 7256 types 

−→ Take into account the size of these vocabularies

3.3. Qualitative problems with the approach 

2. Frequent function words appear saliently ranked, e.g., 

den (‘the’) 

vil (‘will’)

3.3. Qualitative problems with the approach 

2. Frequent function words appear saliently ranked, e.g., 

den (‘the’) 

vil (‘will’) 

−→ Not excluded from the vocabularies (fixed expressions)

4. Text classification 

4.1. Starting point 

• Largest intersection |D ∩ T | between a domain-specific vocabulary 

D and the set of types T of the text to be classified.

4.2. Problems 

• Highly frequent domain-specific types in a text only count 1 

−→

4.2. Problems 


−→ Largest intersection of D and the set of text tokens, W

4.2. Problems 


−→ Largest intersection of D and the set of text tokens, W 

• Domains with large vocabularies are likely to get a better score 

−→

4.2. Problems 




−→ Size of a domain-specific vocabulary, |D|

4.2. Problems 




−→ Size of a domain-specific vocabulary, |D| 

• Function words may get too much weight 

−→

4.2. Problems 




−→ Size of a domain-specific vocabulary, |D| 

• Function words may get too much weight 

−→ Number of domains a token is a member of

4.3. To compute a score for a certain domain. . . 

1 count text tokens that are a member of the domain t ∈ D ∩W


1 count text tokens that are a member of the domain t ∈ D ∩W 

2 weigh count by size of domain-specific vocabulary v = 1 √ 

|D|




|D| 

3 weigh score by number of domains the text token is 

a member of 

w = 1 d 

where d = ∑i |t ∩ Di|




|D| 



w = 1 d 

where d = ∑i |t ∩ Di| 

4 consider number of ‘unknown’ text tokens u (same as n − k)




|D| 



w = 1 d 


4 consider number of ‘unknown’ text tokens u (same as n − k) 

5 consider number of ‘known’ text tokens k (same as n − u)




|D| 



w = 1 d 



5 consider number of ‘known’ text tokens k (same as n − u) 

6 consider text length (to make score relative) n (same as u + k)




|D| 



w = 1 d 




6 consider text length (to make score relative) n (same as u + k) 

sD =




|D| 



w = 1 d 





sD = 1 

n




|D| 



w = 1 d 





sD = 1 k 

· 

n u




|D| 



w = 1 d 





sD = 1 k 

· · v 

n u




|D| 



w = 1 d 





sD = 1 

n · k 

u · v · ∑ 

t∈D∩W




|D| 



w = 1 d 





sD = 1 k 

· 

n u · v · ∑ wt 

t∈D∩W

5. Comparison: determining new words 

5.1. Procedure 

• Comparison of frequency profiles by log likelihood:

5. Comparison: determining new words 

5.1. Procedure 

• Comparison of frequency profiles by log likelihood: 

– new domain-specific material ←→ DDOC 

– salient vocabulary in new material =⇒ candidates for new 

words

5.2. Example 

Du skal bruge en diskette til installationen. På et 

tidspunkt bliver du spurgt om du vil lave en 

bootdiskette. Erfaringen siger at det godt kan 

betale sig at formatere en diskette i forvejen med 

tjek for dårlige sektorer. Før du installerer Linux, 

skal der være en partition til rådighed, der er stor 

nok til at rumme det hele (samt en swap-partition). 

I løbet af Linux-installationen vil der blive lejlighed 

til at repartitionere så meget, du har behov for, 

inden for den plads, der nu er blevet til rådighed. 

You will need a diskette for the installation. At a 

point you will be asked if you want to create a boot 

diskette. Experience shows that it is worthwhile to 

format a diskette in advance with a check for bad 

sectors. Before you install Linux, there must be 

allocated a partition which is big enough to contain 

everything (as well as a swap partition). During the 

Linux installation there will be an opportunity to 

repartition as much as you need within the space 

which now has been made available.

5.3. Classification 

• The text is classified as a Computing text


• The text is classified as a Computing text 

5.4. Comparison 

• OBS! The small size of the text may bias the result




• OBS! The small size of the text may bias the result 

• However, the text is compared with the DDOC . . .





• However, the text is compared with the DDOC . . . 

and its most salient words are listed as candidates





• However, the text is compared with the DDOC . . . 

and its most salient words are listed as candidates 

together with DDO definition domain labels

5.5. New word candidates 

Type fDDOC fsample DDO domains 

diskette 78 2 Computing 

bootdiskette 0 1 missing entry 

formatere ‘format’ 0 1 Computing 

linux 0 1 missing entry 

linux-installationen ‘the linux installation’ 0 1 missing entry 

⋆partition 0 1 missing entry 

repartitionere ‘repartition’ 0 1 missing entry 

swap-partition 0 1 missing entry

5.6. New sense candidates 

Type fDDOC fsample DDO domains 

rådighed ‘disposition’ 1730 2 General 

⋆installerer ‘install(s)’ 16 1 General 

Technology 

du ‘you’ 143798 5 General 

⋆installationen ‘the installation’ 34 1 Technology 

Art 

Military 

tjek ‘check’ 100 1 General 

⋆sektorer ‘sectors’ 112 1 Society 

Politics 

Mathematics

6. Discussion, future, and conclusion 

Task: Determine new domain-specific words for lexicographic de- 

scription.



scription. 

Approach: 

1. Corpus =⇒ domain-specific vocabularies



scription. 

Approach: 

1. Corpus =⇒ domain-specific vocabularies 

2. Domain-specific vocabularies =⇒ text classification



scription. 

Approach: 


2. Domain-specific vocabularies =⇒ text classification 

3. Domain-classified new material ←→ original corpus



scription. 

Approach: 



3. Domain-classified new material ←→ original corpus 

4. Salient words =⇒ candidates for new entries/definitions



scription. 

Approach: 



3. Domain-classified new material ←→ original corpus 

4. Salient words =⇒ candidates for new entries/definitions 

Each of these steps involves some basic decisions which unintentionally 

may influence the outcome.

6.1. Basic decisions – and questions 

• DDOC domain classification 

– High number of domains 

– Greatly varying amount of text under each domain




– Greatly varying amount of text under each domain 

−→ Fewer domains, less variance? 

• Significance test 

– Log likelihood







– Log likelihood 

−→ Better suited tests (e.g. Mann-Whitney ranks test)?







– Log likelihood 

−→ Better suited tests (e.g. Mann-Whitney ranks test)? 

−→ Are they reflecting the nature of the object investigated?

• Classification procedure 

– Reflect properties of the material to be processed


– Reflect properties of the material to be processed 

∗ Token overlap between a text and a domain vocabulary 

∗ Size of the domain-specific vocabulary 

∗ Uniqueness of a certain type for a particular domain 

∗ Ratio between recognised and unrecognised tokens






∗ Ratio between recognised and unrecognised tokens 

−→ Other properties, e.g. salience rank? 

−→ Consequences of the properties being based on intuition? 

−→ Is the quantification appropriate? 

Yes, it seems to yield acceptable results 

No, it doesn’t explain nor reflect the nature of language






∗ Ratio between recognised and unrecognised tokens 

−→ Other properties, e.g. salience rank? 

−→ Consequences of the properties being based on intuition? 

−→ Is the quantification appropriate? 

Yes, it seems to yield acceptable results 

No, it doesn’t explain nor reflect the nature of language 

−→ More appropriate classification approaches?

• Vocabulary extraction 

– Based on comparison of new material with the DDOC

• Vocabulary extraction 

– Based on comparison of new material with the DDOC 

−→ Compare with domain-specific DDOC subcorpus instead?

6.2. Testing in progress 

• Mutual dependencies between these decisions are complex 

−→ Testing of various alternating parameters



−→ Testing of various alternating parameters 

• To test the text classification procedure: 

–





– Divide the DDOC into two parts 

–






– Same relative amount of text material from each domain 

–







– Part 1 =⇒ domain-specific vocabularies 

–







– Part 1 =⇒ domain-specific vocabularies 

– Part 2 =⇒ test various methodological alternatives

6.3. Future work 

• Domain-specific vocabularies develop and change over time


• Domain-specific vocabularies develop and change over time 

– Consider newly determined domain words 

∗




∗ Possible approaches: 

∗ 

· add new domain-specific vocabulary to the already 

existing 

· add new domain-specific material to the DDOC 

and re-run procedure




∗ Possible approaches: 

· add new domain-specific vocabulary to the already 

existing 

· add new domain-specific material to the DDOC 

and re-run procedure 

∗ Tests must show which approach performs best

6.4. Conclusion 

Good: 

• The mechanism is applicable for the described task


Good: 

• The mechanism is applicable for the described task 

Bad: 

• No explanation of what makes a word nor a text domain-specific 

• No explanation of what makes a word new


Good: 


Bad: 


• No explanation of what makes a word new 

Well . . . 

• Even if it is quantitative. . .


Good: 


Bad: 



Well . . . 

• Even if it is quantitative. . . 

– It is still based on human intuition about language


Good: 


Bad: 



Well . . . 

• Even if it is quantitative. . . 

– It is still based on human intuition about language 

– But are not most quantitative approaches in linguistics?

Automatic detection of new domain-specific words, using document ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?