04.05.2013 Views

E E 3J2 D ata M ining L ecture 3 Z ipf's L aw Stem m ing & Stop L ists

E E 3J2 D ata M ining L ecture 3 Z ipf's L aw Stem m ing & Stop L ists

E E 3J2 D ata M ining L ecture 3 Z ipf's L aw Stem m ing & Stop L ists

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong><br />

L<strong>ecture</strong> 3<br />

Zipf’s L<strong>aw</strong><br />

<strong>Stem</strong>m<strong>ing</strong> & <strong>Stop</strong> L<strong>ists</strong><br />

Martin Russell<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 1


Objectives<br />

Understand Zipf’s l<strong>aw</strong><br />

Understand utility of Zipf’s l<strong>aw</strong> for IR<br />

Understand motivation and methods of <strong>Stem</strong>m<strong>ing</strong><br />

Understand definition and use of <strong>Stop</strong> L<strong>ists</strong><br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 2


Zipf’s L<strong>aw</strong><br />

George K<strong>ing</strong>sley Zipf (1902-1950)<br />

– For each word w, let F(w) be the number of times w<br />

occurs in the corpus<br />

– Sort the words accord<strong>ing</strong> to frequency<br />

– The word’s rank-frequency distribution will be fitted<br />

closely by the function:<br />

( r)<br />

= , where ≈1,<br />

C ≈ 0.<br />

1<br />

C<br />

F α<br />

α<br />

r<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 3


Word Frequency Plot: “Alice in Wonderland”<br />

Zipf’s l<strong>aw</strong> Actual statistics from “Alice in Wonderland”<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

Frequency<br />

500<br />

0<br />

which<br />

go<br />

think<br />

turtle<br />

an<br />

k<strong>ing</strong><br />

time<br />

or<br />

would<br />

know<br />

up<br />

they<br />

be<br />

all<br />

was<br />

of<br />

the<br />

Word (Ranked by Frequency)<br />

Different words 2,787, Total words 26,395<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 4


Explanations of Zipf’s L<strong>aw</strong> (1)<br />

L<strong>ing</strong>uistic / psychol<strong>ing</strong>uistic explanation<br />

Mathematical explanation<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 5


Principle of Least Effort<br />

Basically, when a person applies a tool to a job, he or she tries<br />

to minimise the effort to achieve an acceptable goal<br />

Example – speech communication<br />

– Goal is to exchange information<br />

– Speaker would prefer minimum effort (minimum<br />

articulation, vocabulary etc)<br />

– Listener would prefer careful articulation, detailed &<br />

specific vocabulary<br />

– Feedback from listener indicates success of failure of<br />

communication<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 6


Speech Communication<br />

So, if the talker and listener are familiar friends from<br />

the same l<strong>ing</strong>uistic background, minimal articulation<br />

is used<br />

But if the talker and listener are, say, both nonnative<br />

users of the language then careful articulation<br />

and choice of vocabulary is important<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 7


‘Text communication’<br />

For an author of text there is typically no immediate<br />

feedback<br />

The ‘Principle of Least Effort’ suggests that the<br />

author will use a basic ‘work<strong>ing</strong>’ vocabulary where<br />

possible, with uncommon, special words for<br />

particular tasks<br />

This appears to be consistent with Zipf’s l<strong>aw</strong><br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 8


Mathematical explanation<br />

Monkey Text<br />

– Imagine a typewriter with just 5 keys:<br />

a, b, c, d and <br />

– Suppose that a monkey sits at the typewriter and presses<br />

each key with equal probability p:<br />

p = p(a) = p(b) = p(c) = p(d) = p() =1/5<br />

– As before, we’ll say that a word is a sequence of<br />

characters bordered by s<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 9


‘Monkey text’ continued<br />

Probability of a particular 1 character ‘word’ x is:<br />

p(x) × p() = 1/25<br />

There are 4 1-character ‘words’<br />

Probability of a particular 2 character ‘word’ xy is:<br />

p(x) × p(y) × p() = 1/125<br />

There are 4 × 4 = 16 2-character ‘words’<br />

etc<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 10


Graphical ‘tree’ representation<br />

0<br />

A B Z<br />

<br />

1<br />

A B Z<br />

<br />

p(A)<br />

2<br />

A B Z A B Z<br />

p(AA) p(AZ)<br />

3<br />

A B Z<br />

<br />

p(AAA)<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 11


Zipf’s L<strong>aw</strong> and ‘Monkey Text’<br />

0.045<br />

0.04<br />

0.035<br />

0.03<br />

0.025<br />

0.02<br />

Probability<br />

0.015<br />

0.01<br />

0.005<br />

0<br />

97<br />

91<br />

85<br />

79<br />

73<br />

67<br />

61<br />

55<br />

49<br />

43<br />

37<br />

31<br />

25<br />

19<br />

13<br />

7<br />

1<br />

Word (ranked by probability)<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 12


Observations<br />

This analysis isn’t quite right (although the basic conclusion<br />

is correct)!<br />

Zipf’s L<strong>aw</strong> applies to the probability distribution of words in<br />

a text<br />

The probabilities that we have calculated only form a<br />

distribution (i.e. sum to one) if all sequences are considered<br />

But we are only interested in sequences that begin and end<br />

with a space<br />

Therefore we need to normalise the subset of probabilities<br />

correspond<strong>ing</strong> to the sequences of interest<br />

This is dealt with in Belew, pages 150-152 (but beware, there<br />

are some errors!)<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 13


More formal analysis<br />

Suppose the alphabet has M characters, plus a<br />

space character <br />

p = p(A) = … = p(Z) = 1/(M+1)<br />

So, the probability of a particular ‘word’ wk of<br />

length k is (remember the spaces before and<br />

after the word)<br />

p(w k) = p (k+2)<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 14


Calculat<strong>ing</strong> word probability<br />

Given an infinitely long text, the probability pk that wk occurs<br />

in the text must be proportional to p(wk )<br />

pk = cp(w) = c/(M+1) (k+2)<br />

The number of words of length k is Mk . So the probability of<br />

any word of length k occurr<strong>ing</strong> is Mkpk It must be the case that the sum of these probabilities over all<br />

k is 1, and we can use this to find c:<br />

1<br />

2<br />

( M + 1)<br />

( ) k<br />

=<br />

k<br />

p<br />

⇒<br />

=<br />

c<br />

⇒<br />

,<br />

c<br />

( k +<br />

2)<br />

( M + 1)<br />

k<br />

M<br />

= ∑ ∞<br />

1<br />

1<br />

+<br />

M<br />

M<br />

M<br />

= 1<br />

k<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 15


Calculat<strong>ing</strong> word rank<br />

Number of words of length 1 is M<br />

Number of words of length 2 is M2 In general, number of words of length k is Mk Therefore, the number of words of length k or less is:<br />

( )<br />

k<br />

k<br />

i M 1−<br />

M<br />

Nk<br />

= ∑ M =<br />

i=<br />

1 1−<br />

M<br />

Our word wk occurs less frequently that shorter words and<br />

more frequently than longer words. In other words, if rk is<br />

the rank of wk :<br />

k<br />

N<br />

N < r ≤<br />

k<br />

−1<br />

k<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 16


Calculat<strong>ing</strong> word rank (continued)<br />

Therefore, on average:<br />

( k<br />

M −1)<br />

2(<br />

M −1)<br />

1<br />

+<br />

M<br />

=<br />

k<br />

N<br />

+<br />

+ 1<br />

2<br />

k −1<br />

=<br />

N<br />

k<br />

r<br />

We can now derive two expressions for k, one in terms of pk and one in terms of rk .<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 17


Relat<strong>ing</strong> probability and rank<br />

By sett<strong>ing</strong> these equal we can get an expression for pk in<br />

terms of rk , which is what you need to compare with the<br />

Zipf curve:<br />

( ) α<br />

C<br />

+ B<br />

=<br />

r<br />

k<br />

p<br />

k<br />

This is of the same form as Zipf. The values of B and C<br />

depend on M and α<br />

The value of α depends of M (See Belew, pages 150-152<br />

for details)<br />

For M=26, α = 1.012, C = 0.02, B = 0.54<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 18


Conclusions (Zipf’s L<strong>aw</strong>)<br />

Zipf’s l<strong>aw</strong> appears to reflect simple character<br />

statistics, rather than mean<strong>ing</strong><br />

Of limited direct relevance for IR<br />

Potentially useful for identify<strong>ing</strong> keywords<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 19


‘Resolv<strong>ing</strong> Power’ of words<br />

3000<br />

‘Resolv<strong>ing</strong> power’ of word<br />

2500<br />

2000<br />

1500<br />

Zipf’s L<strong>aw</strong><br />

1000<br />

Frequency<br />

500<br />

Words too rare<br />

0<br />

106<br />

99<br />

92<br />

85<br />

78<br />

71<br />

64<br />

57<br />

50<br />

43<br />

36<br />

29<br />

22<br />

15<br />

8<br />

1<br />

Words<br />

too<br />

common<br />

Word (ranked by frequency)<br />

Upper cutoff Lower cutoff<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 20


<strong>Stem</strong>m<strong>ing</strong> (morphology)<br />

Remove surface mark<strong>ing</strong>s from words to reveal their<br />

basic form:<br />

– forms → form<br />

– form<strong>ing</strong> → form<br />

– formed → form<br />

– former → form<br />

If a query and document contain different forms of<br />

the same word, we want to know this<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 21


<strong>Stem</strong>m<strong>ing</strong><br />

Of course, not all words obey such simple rules:<br />

– runn<strong>ing</strong> → run<br />

– runs → run<br />

– women → woman<br />

– leaves → leaf<br />

– ferries → ferry<br />

– alumnus → alumni<br />

[Belew, chapter 2]<br />

– datum → d<strong>ata</strong><br />

– crisis → crises<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 22


<strong>Stem</strong>m<strong>ing</strong><br />

L<strong>ing</strong>u<strong>ists</strong> dist<strong>ing</strong>uish between different types of<br />

morphology:<br />

– Minor changes, such as plurals, tense<br />

– Major changes, e.g. incentive → incentivize, which<br />

change the grammatical category of a word<br />

Common solution is to identify sub-pattern of letters<br />

within words and devise rules for deal<strong>ing</strong> with these<br />

patterns<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 23


<strong>Stem</strong>m<strong>ing</strong><br />

Example rules [Belew, p 45]<br />

– (.*)SSES → /1SS<br />

– Any str<strong>ing</strong> end<strong>ing</strong> SSES is stemmed by replac<strong>ing</strong><br />

SSES with SS<br />

– (.[AEIOU].*)ED → /1<br />

– Any str<strong>ing</strong> conta<strong>in<strong>ing</strong></strong> a vowel and end<strong>ing</strong> in ED is<br />

stemmed by remov<strong>ing</strong> the ED<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 24


<strong>Stem</strong>mers<br />

A stemmer is a piece of software which implements<br />

a stemm<strong>ing</strong> algorithm<br />

The Porter stemmer is a standard stemmer which is<br />

available as a free download (see EE<strong>3J2</strong> webpage<br />

for the URL)<br />

The Porter stemmer implements a set of about 60<br />

rules<br />

Use of a stemmer typically reduces vocabulary size<br />

by 10% to 50%<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 25


Example<br />

Apply the Porter stemmer to the ‘Jane Eyre’ and<br />

‘Alice in Wonderland’ texts<br />

34%<br />

reduction<br />

16000<br />

14000<br />

12000<br />

10000<br />

Before stemm<strong>ing</strong><br />

8000<br />

After <strong>Stem</strong>m<strong>ing</strong><br />

22%<br />

reduction<br />

6000<br />

4000<br />

2000<br />

Number of different words<br />

0<br />

Alice Jane Eyre<br />

Text<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 26


Example<br />

Examples of results of Porter stemmer:<br />

– form → form<br />

– former → former<br />

– formed → form<br />

– form<strong>ing</strong> → form<br />

– formal → formal<br />

– formality → formal<br />

– formalism → formal<br />

– formica → formica<br />

– formic → formic<br />

– formant → formant<br />

– format → format<br />

– formation → format<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 27


Example: First paragraph from<br />

After<br />

‘Alice in Wonderland’<br />

Before<br />

alic wa begin to get veri tire of<br />

sit by her sister on the bank, and<br />

of have noth to do: onc or twice<br />

she had peep into the book her<br />

sister wa read, but it had no<br />

pictur or convers in it, ‘and what<br />

is the us of a book,’ thought alic<br />

‘without pictur or convers?’<br />

Alice was beginn<strong>ing</strong> to get very<br />

tired of sitt<strong>ing</strong> by her sister on<br />

the bank, and of hav<strong>ing</strong> noth<strong>ing</strong><br />

to do: once or twice she had<br />

peeped into the book her sister<br />

was read<strong>ing</strong>, but it had no<br />

pictures or conversations in it,<br />

‘and what is the use of a<br />

book,‘thought Alice ‘without<br />

pictures or conversation?’<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 28


Noise Words<br />

There was no possibility of tak<strong>ing</strong> a walk that day. We had<br />

been wander<strong>ing</strong>, indeed, in the leafless shrubbery an hour in<br />

the morn<strong>ing</strong>; but since dinner (Mrs. Reed, when there was no<br />

company, dined early) the cold winter wind had brought with it<br />

clouds so sombre, and a rain so penetrat<strong>ing</strong>, that further outdoor<br />

exercise was now out of the question<br />

Noise words<br />

– Vital to understand the grammatical structure of a text<br />

– Of little use in the ‘bundle of words’ approach<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 29


<strong>Stop</strong> L<strong>ists</strong><br />

In Information Retrieval, these words are often<br />

referred to as <strong>Stop</strong> Words<br />

Rather than detect<strong>ing</strong> stop words us<strong>ing</strong> rules, or<br />

some other form of analysis, the stop words are<br />

simply specified to the system in a list: the <strong>Stop</strong> List<br />

<strong>Stop</strong> L<strong>ists</strong> typically consist of the most common<br />

words from some large corpus<br />

There are lots of candidate stop l<strong>ists</strong> online<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 30


Example 1: Short <strong>Stop</strong> List (50 wds)<br />

who<br />

will<br />

more<br />

if<br />

out<br />

so<br />

her<br />

all<br />

she<br />

there<br />

would<br />

their<br />

we<br />

him<br />

been<br />

has<br />

when<br />

not<br />

are<br />

but<br />

from<br />

or<br />

have<br />

an<br />

they<br />

which<br />

you<br />

were<br />

it<br />

with<br />

as<br />

his<br />

on<br />

be<br />

at<br />

by<br />

i<br />

this<br />

had<br />

the<br />

of<br />

and<br />

to<br />

a<br />

in<br />

that<br />

is<br />

was<br />

he<br />

for<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 31


Example 2: 300 Word <strong>Stop</strong> List<br />

whose<br />

special<br />

heard<br />

major<br />

problems<br />

ago<br />

became<br />

federal<br />

moment<br />

study<br />

available<br />

known<br />

result<br />

street<br />

economic<br />

boy<br />

held<br />

keep<br />

sure<br />

probably<br />

free<br />

real<br />

seems<br />

behind<br />

cannot<br />

miss<br />

political<br />

air<br />

question<br />

mak<strong>ing</strong><br />

office<br />

brought<br />

……….<br />

more<br />

no<br />

if<br />

out<br />

so<br />

said<br />

what<br />

up<br />

its<br />

about<br />

into<br />

than<br />

them<br />

can<br />

only<br />

other<br />

one<br />

you<br />

were<br />

her<br />

all<br />

she<br />

there<br />

would<br />

their<br />

we<br />

him<br />

been<br />

has<br />

when<br />

who<br />

will<br />

on<br />

be<br />

at<br />

by<br />

i<br />

this<br />

had<br />

not<br />

are<br />

but<br />

from<br />

or<br />

have<br />

an<br />

they<br />

which<br />

the<br />

of<br />

and<br />

to<br />

a<br />

in<br />

that<br />

is<br />

was<br />

he<br />

for<br />

it<br />

with<br />

as<br />

his<br />

300 most common words from Brown Corpus<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 32


EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 33<br />

Alice vs Brown: Most Frequent<br />

Words<br />

the<br />

and<br />

to<br />

a<br />

she<br />

it<br />

of<br />

said<br />

i<br />

alice<br />

in<br />

you<br />

was<br />

that<br />

as<br />

her<br />

at<br />

on<br />

all<br />

with<br />

had<br />

but<br />

for<br />

so<br />

be<br />

not<br />

very<br />

what<br />

this<br />

they<br />

little<br />

he<br />

out<br />

is<br />

one<br />

down<br />

up<br />

his<br />

if<br />

about<br />

then<br />

no<br />

know<br />

them<br />

like<br />

were<br />

again<br />

herself<br />

went<br />

would<br />

do<br />

have<br />

when<br />

could<br />

or<br />

there<br />

thought<br />

off<br />

how<br />

me<br />

the<br />

of<br />

and<br />

to<br />

a<br />

in<br />

that<br />

is<br />

was<br />

he<br />

for<br />

it<br />

with<br />

as<br />

his<br />

on<br />

be<br />

at<br />

by<br />

i<br />

this<br />

Had<br />

not<br />

are<br />

but<br />

from<br />

or<br />

have<br />

an<br />

they<br />

which<br />

you<br />

were<br />

her<br />

all<br />

she<br />

there<br />

would<br />

their<br />

we<br />

him<br />

been<br />

has<br />

when<br />

who<br />

will<br />

more<br />

if<br />

out<br />

so


stop.c<br />

C program on course website<br />

– Reads in a stop list file (text file, one word per line)<br />

– Stores stop words in char **stopList<br />

– Read text file one word at a time<br />

– Compares each word with each stop word<br />

– Prints out words not in stop list<br />

stop stopListFile textFile > opFile<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 34


<strong>Stop</strong> list 50 removed<br />

Examples<br />

alice beginn<strong>ing</strong> get very tired<br />

sitt<strong>ing</strong> sister bank hav<strong>ing</strong> noth<strong>ing</strong><br />

do once twice peeped into book<br />

sister read<strong>ing</strong> no pictures<br />

conversations what use book<br />

thought alice without pictures<br />

conversation<br />

Original first paragraph<br />

<strong>Stop</strong> list Brown removed<br />

alice beginn<strong>ing</strong> tired sitt<strong>ing</strong> sister<br />

bank twice peeped book sister<br />

read<strong>ing</strong> pictures conversations<br />

book alice pictures<br />

Alice was beginn<strong>ing</strong> to get very<br />

tired of sitt<strong>ing</strong> by her sister on the<br />

bank, and of hav<strong>ing</strong> noth<strong>ing</strong> to do:<br />

once or twice she had peeped into<br />

the book her sister was read<strong>ing</strong>, but<br />

it had no pictures or conversations<br />

in it, `and what is the use of a book,‘<br />

thought Alice `without pictures or<br />

conversation?'<br />

conversation<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 35


Summary<br />

Document<br />

Text query<br />

<strong>Stem</strong>m<strong>ing</strong><br />

<strong>Stem</strong>m<strong>ing</strong><br />

<strong>Stop</strong> Word Removal<br />

<strong>Stop</strong> Word Removal<br />

Match<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 36


Homework<br />

Download Porter <strong>Stem</strong>mer from the web<br />

– See URL on course web page<br />

– Compile and run it under your favourite OS<br />

– Try it out on some words and text corpora<br />

Download stop.c from the web<br />

– Download some stop l<strong>ists</strong><br />

– Compile and run stop.c under your favourite OS<br />

– Try it out on some stop l<strong>ists</strong> and text corpora<br />

How can you make stop.c run on stemmed text?<br />

EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />

Slide 37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!