E E 3J2 D ata M ining L ecture 3 Z ipf's L aw Stem m ing & Stop L ists
E E 3J2 D ata M ining L ecture 3 Z ipf's L aw Stem m ing & Stop L ists
E E 3J2 D ata M ining L ecture 3 Z ipf's L aw Stem m ing & Stop L ists
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong><br />
L<strong>ecture</strong> 3<br />
Zipf’s L<strong>aw</strong><br />
<strong>Stem</strong>m<strong>ing</strong> & <strong>Stop</strong> L<strong>ists</strong><br />
Martin Russell<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 1
Objectives<br />
Understand Zipf’s l<strong>aw</strong><br />
Understand utility of Zipf’s l<strong>aw</strong> for IR<br />
Understand motivation and methods of <strong>Stem</strong>m<strong>ing</strong><br />
Understand definition and use of <strong>Stop</strong> L<strong>ists</strong><br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 2
Zipf’s L<strong>aw</strong><br />
George K<strong>ing</strong>sley Zipf (1902-1950)<br />
– For each word w, let F(w) be the number of times w<br />
occurs in the corpus<br />
– Sort the words accord<strong>ing</strong> to frequency<br />
– The word’s rank-frequency distribution will be fitted<br />
closely by the function:<br />
( r)<br />
= , where ≈1,<br />
C ≈ 0.<br />
1<br />
C<br />
F α<br />
α<br />
r<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 3
Word Frequency Plot: “Alice in Wonderland”<br />
Zipf’s l<strong>aw</strong> Actual statistics from “Alice in Wonderland”<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
Frequency<br />
500<br />
0<br />
which<br />
go<br />
think<br />
turtle<br />
an<br />
k<strong>ing</strong><br />
time<br />
or<br />
would<br />
know<br />
up<br />
they<br />
be<br />
all<br />
was<br />
of<br />
the<br />
Word (Ranked by Frequency)<br />
Different words 2,787, Total words 26,395<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 4
Explanations of Zipf’s L<strong>aw</strong> (1)<br />
L<strong>ing</strong>uistic / psychol<strong>ing</strong>uistic explanation<br />
Mathematical explanation<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 5
Principle of Least Effort<br />
Basically, when a person applies a tool to a job, he or she tries<br />
to minimise the effort to achieve an acceptable goal<br />
Example – speech communication<br />
– Goal is to exchange information<br />
– Speaker would prefer minimum effort (minimum<br />
articulation, vocabulary etc)<br />
– Listener would prefer careful articulation, detailed &<br />
specific vocabulary<br />
– Feedback from listener indicates success of failure of<br />
communication<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 6
Speech Communication<br />
So, if the talker and listener are familiar friends from<br />
the same l<strong>ing</strong>uistic background, minimal articulation<br />
is used<br />
But if the talker and listener are, say, both nonnative<br />
users of the language then careful articulation<br />
and choice of vocabulary is important<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 7
‘Text communication’<br />
For an author of text there is typically no immediate<br />
feedback<br />
The ‘Principle of Least Effort’ suggests that the<br />
author will use a basic ‘work<strong>ing</strong>’ vocabulary where<br />
possible, with uncommon, special words for<br />
particular tasks<br />
This appears to be consistent with Zipf’s l<strong>aw</strong><br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 8
Mathematical explanation<br />
Monkey Text<br />
– Imagine a typewriter with just 5 keys:<br />
a, b, c, d and <br />
– Suppose that a monkey sits at the typewriter and presses<br />
each key with equal probability p:<br />
p = p(a) = p(b) = p(c) = p(d) = p() =1/5<br />
– As before, we’ll say that a word is a sequence of<br />
characters bordered by s<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 9
‘Monkey text’ continued<br />
Probability of a particular 1 character ‘word’ x is:<br />
p(x) × p() = 1/25<br />
There are 4 1-character ‘words’<br />
Probability of a particular 2 character ‘word’ xy is:<br />
p(x) × p(y) × p() = 1/125<br />
There are 4 × 4 = 16 2-character ‘words’<br />
etc<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 10
Graphical ‘tree’ representation<br />
0<br />
A B Z<br />
<br />
1<br />
A B Z<br />
<br />
p(A)<br />
2<br />
A B Z A B Z<br />
p(AA) p(AZ)<br />
3<br />
A B Z<br />
<br />
p(AAA)<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 11
Zipf’s L<strong>aw</strong> and ‘Monkey Text’<br />
0.045<br />
0.04<br />
0.035<br />
0.03<br />
0.025<br />
0.02<br />
Probability<br />
0.015<br />
0.01<br />
0.005<br />
0<br />
97<br />
91<br />
85<br />
79<br />
73<br />
67<br />
61<br />
55<br />
49<br />
43<br />
37<br />
31<br />
25<br />
19<br />
13<br />
7<br />
1<br />
Word (ranked by probability)<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 12
Observations<br />
This analysis isn’t quite right (although the basic conclusion<br />
is correct)!<br />
Zipf’s L<strong>aw</strong> applies to the probability distribution of words in<br />
a text<br />
The probabilities that we have calculated only form a<br />
distribution (i.e. sum to one) if all sequences are considered<br />
But we are only interested in sequences that begin and end<br />
with a space<br />
Therefore we need to normalise the subset of probabilities<br />
correspond<strong>ing</strong> to the sequences of interest<br />
This is dealt with in Belew, pages 150-152 (but beware, there<br />
are some errors!)<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 13
More formal analysis<br />
Suppose the alphabet has M characters, plus a<br />
space character <br />
p = p(A) = … = p(Z) = 1/(M+1)<br />
So, the probability of a particular ‘word’ wk of<br />
length k is (remember the spaces before and<br />
after the word)<br />
p(w k) = p (k+2)<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 14
Calculat<strong>ing</strong> word probability<br />
Given an infinitely long text, the probability pk that wk occurs<br />
in the text must be proportional to p(wk )<br />
pk = cp(w) = c/(M+1) (k+2)<br />
The number of words of length k is Mk . So the probability of<br />
any word of length k occurr<strong>ing</strong> is Mkpk It must be the case that the sum of these probabilities over all<br />
k is 1, and we can use this to find c:<br />
1<br />
2<br />
( M + 1)<br />
( ) k<br />
=<br />
k<br />
p<br />
⇒<br />
=<br />
c<br />
⇒<br />
,<br />
c<br />
( k +<br />
2)<br />
( M + 1)<br />
k<br />
M<br />
= ∑ ∞<br />
1<br />
1<br />
+<br />
M<br />
M<br />
M<br />
= 1<br />
k<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 15
Calculat<strong>ing</strong> word rank<br />
Number of words of length 1 is M<br />
Number of words of length 2 is M2 In general, number of words of length k is Mk Therefore, the number of words of length k or less is:<br />
( )<br />
k<br />
k<br />
i M 1−<br />
M<br />
Nk<br />
= ∑ M =<br />
i=<br />
1 1−<br />
M<br />
Our word wk occurs less frequently that shorter words and<br />
more frequently than longer words. In other words, if rk is<br />
the rank of wk :<br />
k<br />
N<br />
N < r ≤<br />
k<br />
−1<br />
k<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 16
Calculat<strong>ing</strong> word rank (continued)<br />
Therefore, on average:<br />
( k<br />
M −1)<br />
2(<br />
M −1)<br />
1<br />
+<br />
M<br />
=<br />
k<br />
N<br />
+<br />
+ 1<br />
2<br />
k −1<br />
=<br />
N<br />
k<br />
r<br />
We can now derive two expressions for k, one in terms of pk and one in terms of rk .<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 17
Relat<strong>ing</strong> probability and rank<br />
By sett<strong>ing</strong> these equal we can get an expression for pk in<br />
terms of rk , which is what you need to compare with the<br />
Zipf curve:<br />
( ) α<br />
C<br />
+ B<br />
=<br />
r<br />
k<br />
p<br />
k<br />
This is of the same form as Zipf. The values of B and C<br />
depend on M and α<br />
The value of α depends of M (See Belew, pages 150-152<br />
for details)<br />
For M=26, α = 1.012, C = 0.02, B = 0.54<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 18
Conclusions (Zipf’s L<strong>aw</strong>)<br />
Zipf’s l<strong>aw</strong> appears to reflect simple character<br />
statistics, rather than mean<strong>ing</strong><br />
Of limited direct relevance for IR<br />
Potentially useful for identify<strong>ing</strong> keywords<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 19
‘Resolv<strong>ing</strong> Power’ of words<br />
3000<br />
‘Resolv<strong>ing</strong> power’ of word<br />
2500<br />
2000<br />
1500<br />
Zipf’s L<strong>aw</strong><br />
1000<br />
Frequency<br />
500<br />
Words too rare<br />
0<br />
106<br />
99<br />
92<br />
85<br />
78<br />
71<br />
64<br />
57<br />
50<br />
43<br />
36<br />
29<br />
22<br />
15<br />
8<br />
1<br />
Words<br />
too<br />
common<br />
Word (ranked by frequency)<br />
Upper cutoff Lower cutoff<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 20
<strong>Stem</strong>m<strong>ing</strong> (morphology)<br />
Remove surface mark<strong>ing</strong>s from words to reveal their<br />
basic form:<br />
– forms → form<br />
– form<strong>ing</strong> → form<br />
– formed → form<br />
– former → form<br />
If a query and document contain different forms of<br />
the same word, we want to know this<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 21
<strong>Stem</strong>m<strong>ing</strong><br />
Of course, not all words obey such simple rules:<br />
– runn<strong>ing</strong> → run<br />
– runs → run<br />
– women → woman<br />
– leaves → leaf<br />
– ferries → ferry<br />
– alumnus → alumni<br />
[Belew, chapter 2]<br />
– datum → d<strong>ata</strong><br />
– crisis → crises<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 22
<strong>Stem</strong>m<strong>ing</strong><br />
L<strong>ing</strong>u<strong>ists</strong> dist<strong>ing</strong>uish between different types of<br />
morphology:<br />
– Minor changes, such as plurals, tense<br />
– Major changes, e.g. incentive → incentivize, which<br />
change the grammatical category of a word<br />
Common solution is to identify sub-pattern of letters<br />
within words and devise rules for deal<strong>ing</strong> with these<br />
patterns<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 23
<strong>Stem</strong>m<strong>ing</strong><br />
Example rules [Belew, p 45]<br />
– (.*)SSES → /1SS<br />
– Any str<strong>ing</strong> end<strong>ing</strong> SSES is stemmed by replac<strong>ing</strong><br />
SSES with SS<br />
– (.[AEIOU].*)ED → /1<br />
– Any str<strong>ing</strong> conta<strong>in<strong>ing</strong></strong> a vowel and end<strong>ing</strong> in ED is<br />
stemmed by remov<strong>ing</strong> the ED<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 24
<strong>Stem</strong>mers<br />
A stemmer is a piece of software which implements<br />
a stemm<strong>ing</strong> algorithm<br />
The Porter stemmer is a standard stemmer which is<br />
available as a free download (see EE<strong>3J2</strong> webpage<br />
for the URL)<br />
The Porter stemmer implements a set of about 60<br />
rules<br />
Use of a stemmer typically reduces vocabulary size<br />
by 10% to 50%<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 25
Example<br />
Apply the Porter stemmer to the ‘Jane Eyre’ and<br />
‘Alice in Wonderland’ texts<br />
34%<br />
reduction<br />
16000<br />
14000<br />
12000<br />
10000<br />
Before stemm<strong>ing</strong><br />
8000<br />
After <strong>Stem</strong>m<strong>ing</strong><br />
22%<br />
reduction<br />
6000<br />
4000<br />
2000<br />
Number of different words<br />
0<br />
Alice Jane Eyre<br />
Text<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 26
Example<br />
Examples of results of Porter stemmer:<br />
– form → form<br />
– former → former<br />
– formed → form<br />
– form<strong>ing</strong> → form<br />
– formal → formal<br />
– formality → formal<br />
– formalism → formal<br />
– formica → formica<br />
– formic → formic<br />
– formant → formant<br />
– format → format<br />
– formation → format<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 27
Example: First paragraph from<br />
After<br />
‘Alice in Wonderland’<br />
Before<br />
alic wa begin to get veri tire of<br />
sit by her sister on the bank, and<br />
of have noth to do: onc or twice<br />
she had peep into the book her<br />
sister wa read, but it had no<br />
pictur or convers in it, ‘and what<br />
is the us of a book,’ thought alic<br />
‘without pictur or convers?’<br />
Alice was beginn<strong>ing</strong> to get very<br />
tired of sitt<strong>ing</strong> by her sister on<br />
the bank, and of hav<strong>ing</strong> noth<strong>ing</strong><br />
to do: once or twice she had<br />
peeped into the book her sister<br />
was read<strong>ing</strong>, but it had no<br />
pictures or conversations in it,<br />
‘and what is the use of a<br />
book,‘thought Alice ‘without<br />
pictures or conversation?’<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 28
Noise Words<br />
There was no possibility of tak<strong>ing</strong> a walk that day. We had<br />
been wander<strong>ing</strong>, indeed, in the leafless shrubbery an hour in<br />
the morn<strong>ing</strong>; but since dinner (Mrs. Reed, when there was no<br />
company, dined early) the cold winter wind had brought with it<br />
clouds so sombre, and a rain so penetrat<strong>ing</strong>, that further outdoor<br />
exercise was now out of the question<br />
Noise words<br />
– Vital to understand the grammatical structure of a text<br />
– Of little use in the ‘bundle of words’ approach<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 29
<strong>Stop</strong> L<strong>ists</strong><br />
In Information Retrieval, these words are often<br />
referred to as <strong>Stop</strong> Words<br />
Rather than detect<strong>ing</strong> stop words us<strong>ing</strong> rules, or<br />
some other form of analysis, the stop words are<br />
simply specified to the system in a list: the <strong>Stop</strong> List<br />
<strong>Stop</strong> L<strong>ists</strong> typically consist of the most common<br />
words from some large corpus<br />
There are lots of candidate stop l<strong>ists</strong> online<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 30
Example 1: Short <strong>Stop</strong> List (50 wds)<br />
who<br />
will<br />
more<br />
if<br />
out<br />
so<br />
her<br />
all<br />
she<br />
there<br />
would<br />
their<br />
we<br />
him<br />
been<br />
has<br />
when<br />
not<br />
are<br />
but<br />
from<br />
or<br />
have<br />
an<br />
they<br />
which<br />
you<br />
were<br />
it<br />
with<br />
as<br />
his<br />
on<br />
be<br />
at<br />
by<br />
i<br />
this<br />
had<br />
the<br />
of<br />
and<br />
to<br />
a<br />
in<br />
that<br />
is<br />
was<br />
he<br />
for<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 31
Example 2: 300 Word <strong>Stop</strong> List<br />
whose<br />
special<br />
heard<br />
major<br />
problems<br />
ago<br />
became<br />
federal<br />
moment<br />
study<br />
available<br />
known<br />
result<br />
street<br />
economic<br />
boy<br />
held<br />
keep<br />
sure<br />
probably<br />
free<br />
real<br />
seems<br />
behind<br />
cannot<br />
miss<br />
political<br />
air<br />
question<br />
mak<strong>ing</strong><br />
office<br />
brought<br />
……….<br />
more<br />
no<br />
if<br />
out<br />
so<br />
said<br />
what<br />
up<br />
its<br />
about<br />
into<br />
than<br />
them<br />
can<br />
only<br />
other<br />
one<br />
you<br />
were<br />
her<br />
all<br />
she<br />
there<br />
would<br />
their<br />
we<br />
him<br />
been<br />
has<br />
when<br />
who<br />
will<br />
on<br />
be<br />
at<br />
by<br />
i<br />
this<br />
had<br />
not<br />
are<br />
but<br />
from<br />
or<br />
have<br />
an<br />
they<br />
which<br />
the<br />
of<br />
and<br />
to<br />
a<br />
in<br />
that<br />
is<br />
was<br />
he<br />
for<br />
it<br />
with<br />
as<br />
his<br />
300 most common words from Brown Corpus<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 32
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 33<br />
Alice vs Brown: Most Frequent<br />
Words<br />
the<br />
and<br />
to<br />
a<br />
she<br />
it<br />
of<br />
said<br />
i<br />
alice<br />
in<br />
you<br />
was<br />
that<br />
as<br />
her<br />
at<br />
on<br />
all<br />
with<br />
had<br />
but<br />
for<br />
so<br />
be<br />
not<br />
very<br />
what<br />
this<br />
they<br />
little<br />
he<br />
out<br />
is<br />
one<br />
down<br />
up<br />
his<br />
if<br />
about<br />
then<br />
no<br />
know<br />
them<br />
like<br />
were<br />
again<br />
herself<br />
went<br />
would<br />
do<br />
have<br />
when<br />
could<br />
or<br />
there<br />
thought<br />
off<br />
how<br />
me<br />
the<br />
of<br />
and<br />
to<br />
a<br />
in<br />
that<br />
is<br />
was<br />
he<br />
for<br />
it<br />
with<br />
as<br />
his<br />
on<br />
be<br />
at<br />
by<br />
i<br />
this<br />
Had<br />
not<br />
are<br />
but<br />
from<br />
or<br />
have<br />
an<br />
they<br />
which<br />
you<br />
were<br />
her<br />
all<br />
she<br />
there<br />
would<br />
their<br />
we<br />
him<br />
been<br />
has<br />
when<br />
who<br />
will<br />
more<br />
if<br />
out<br />
so
stop.c<br />
C program on course website<br />
– Reads in a stop list file (text file, one word per line)<br />
– Stores stop words in char **stopList<br />
– Read text file one word at a time<br />
– Compares each word with each stop word<br />
– Prints out words not in stop list<br />
stop stopListFile textFile > opFile<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 34
<strong>Stop</strong> list 50 removed<br />
Examples<br />
alice beginn<strong>ing</strong> get very tired<br />
sitt<strong>ing</strong> sister bank hav<strong>ing</strong> noth<strong>ing</strong><br />
do once twice peeped into book<br />
sister read<strong>ing</strong> no pictures<br />
conversations what use book<br />
thought alice without pictures<br />
conversation<br />
Original first paragraph<br />
<strong>Stop</strong> list Brown removed<br />
alice beginn<strong>ing</strong> tired sitt<strong>ing</strong> sister<br />
bank twice peeped book sister<br />
read<strong>ing</strong> pictures conversations<br />
book alice pictures<br />
Alice was beginn<strong>ing</strong> to get very<br />
tired of sitt<strong>ing</strong> by her sister on the<br />
bank, and of hav<strong>ing</strong> noth<strong>ing</strong> to do:<br />
once or twice she had peeped into<br />
the book her sister was read<strong>ing</strong>, but<br />
it had no pictures or conversations<br />
in it, `and what is the use of a book,‘<br />
thought Alice `without pictures or<br />
conversation?'<br />
conversation<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 35
Summary<br />
Document<br />
Text query<br />
<strong>Stem</strong>m<strong>ing</strong><br />
<strong>Stem</strong>m<strong>ing</strong><br />
<strong>Stop</strong> Word Removal<br />
<strong>Stop</strong> Word Removal<br />
Match<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 36
Homework<br />
Download Porter <strong>Stem</strong>mer from the web<br />
– See URL on course web page<br />
– Compile and run it under your favourite OS<br />
– Try it out on some words and text corpora<br />
Download stop.c from the web<br />
– Download some stop l<strong>ists</strong><br />
– Compile and run stop.c under your favourite OS<br />
– Try it out on some stop l<strong>ists</strong> and text corpora<br />
How can you make stop.c run on stemmed text?<br />
EE<strong>3J2</strong> D<strong>ata</strong> M<strong>in<strong>ing</strong></strong> 2008 – l<strong>ecture</strong> 3<br />
Slide 37