VECTOR SPACE CLASSiFiCATiON VECTOR SPACE ...
VECTOR SPACE CLASSiFiCATiON VECTOR SPACE ...
VECTOR SPACE CLASSiFiCATiON VECTOR SPACE ...
Transform your PDFs into Flipbooks and boost your revenue!
Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.
<strong>VECTOR</strong> <strong>SPACE</strong> CLASSIFICATION<br />
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,<br />
Introduction to Information Retrieval, Cambridge University Press.<br />
Chapter 14<br />
Wei Wei<br />
wwei@idi.ntnu.no<br />
Lecture series<br />
TDT4215 Vector Space Classification 1
RecALL: Naïve Bayes classifiers<br />
• Classify based on prior weight of class and conditional parameter for<br />
what each word says:<br />
<br />
<br />
c NB<br />
argmax<br />
log P(c j<br />
) log P(x i<br />
| c j<br />
)<br />
<br />
c j C <br />
i positions <br />
• Training is done by counting and dividing:<br />
P(c j<br />
) N c j<br />
N<br />
• Don’t forget to smooth<br />
P(x k<br />
| c j<br />
) T c j x k<br />
<br />
<br />
xi V<br />
[T c j x i<br />
]<br />
TDT4215<br />
Vector Space Classification<br />
2
Vector <strong>SPACE</strong> text classification<br />
• Today:<br />
– Vector space methods for Text<br />
Classification<br />
• Rocchio classification<br />
• K Nearest Neighbors<br />
– Linear classifier and non-linear classifier<br />
– Classification with more than two classes<br />
TDT4215<br />
Vector Space Classification<br />
3
Vector <strong>SPACE</strong> text classification<br />
Vector space methods for<br />
Text Classification<br />
TDT4215<br />
Vector Space Classification<br />
4
<strong>VECTOR</strong> <strong>SPACE</strong> CLASSIFICATION<br />
• Vector Space Representation<br />
– Each document is a vector, one component for<br />
each term (= word).<br />
– Normally normalize vectors to unit length.<br />
– High-dimensional vector space:<br />
• Terms are axes<br />
• 10,000+ 000+ dimensions, or even 100,000+ 000+<br />
• Docs are vectors in this space<br />
– How can we do classification in this space?<br />
TDT4215<br />
Vector Space Classification<br />
5
<strong>VECTOR</strong> <strong>SPACE</strong> CLASSIFICATION<br />
• As before, the training set is a set of documents,<br />
each labeled with its class (e.g., topic)<br />
• In vector space classification, this set corresponds<br />
to a labeled set of points (or, equivalently, vectors)<br />
in the vector space<br />
• Hypothesis 1: Documents in the same class form a<br />
contiguous region of space<br />
• Hypothesis 2: Documents from different classes<br />
don’t overlap<br />
• We define surfaces to delineate classes in the space<br />
TDT4215<br />
Vector Space Classification<br />
6
Documents in a vector space<br />
Government<br />
Science<br />
Arts<br />
TDT4215<br />
Vector Space Classification<br />
7
Test document: which class?<br />
Government<br />
Science<br />
Arts<br />
TDT4215<br />
Vector Space Classification<br />
8
Test document = government<br />
Government<br />
Science<br />
Arts<br />
Our main topic today is how to find good separators<br />
TDT4215<br />
Vector Space Classification<br />
9<br />
9
Vector <strong>SPACE</strong> text classification<br />
Rocchio text classification<br />
i<br />
TDT4215<br />
Vector Space Classification<br />
10
Rocchio text classification<br />
• Rocchio Text Classification<br />
– Use standard tf-idf weighted vectors to<br />
represent text documents<br />
– For training documents in each category, compute<br />
a prototype vector by summing the vectors of the<br />
training documents in the category<br />
• Prototype = centroid of fmembers m of class<br />
– Assign test documents to the category with the<br />
closest prototype vector based on cosine<br />
similarity<br />
TDT4215<br />
Vector Space Classification<br />
11
DEFINITION OF CENTROID<br />
<br />
(c) 1 <br />
v (d)<br />
| D <br />
c<br />
| d Dc<br />
• Where D c is the set of all documents that belong<br />
to class c and v (d) is the vector space<br />
representation of d.<br />
• Note that centroid will in general not be a unit<br />
vector even when the inputs are unit vectors.<br />
TDT4215<br />
Vector Space Classification<br />
12
ROCCHIO TEXT CLASSIFICATION<br />
r<br />
r1<br />
r2<br />
r3<br />
b1<br />
b2<br />
t=b<br />
b<br />
TDT4215<br />
Vector Space Classification<br />
13
ROCCHIO PROPERTIES<br />
• Forms a simple generalization of the<br />
examples in each class (a prototype).<br />
• Prototype vector does not need to be<br />
averaged or otherwise normalized for length<br />
since cosine similarity is insensitive to<br />
vector length.<br />
• Classification is based on similarity to class<br />
prototypes.<br />
• Does not guarantee classifications are<br />
consistent with the given training data.<br />
TDT4215<br />
Vector Space Classification<br />
14
ROCCHIO ANOMALY<br />
• Prototype models have problems with polymorphic (disjunctive)<br />
categories.<br />
r<br />
r1<br />
r2<br />
t=r<br />
b<br />
b1<br />
b2<br />
r3<br />
r4<br />
TDT4215<br />
Vector Space Classification<br />
15
Vector <strong>SPACE</strong> text classification<br />
k Nearest Neighbor Classification<br />
i<br />
TDT4215<br />
Vector Space Classification<br />
16
K NEAREST NEIGHBOR CLASSIFICATION<br />
• kNN = k Nearest Neighbor<br />
• To classify document d into class c:<br />
• Define k-neighborhood N as k nearest neighbors of<br />
d<br />
• Count number of documents i in N that belong to c<br />
• Estimate P(c|d) as i/k<br />
• Choose as class argmax c P(c|d) [ = majority class]<br />
TDT4215<br />
Vector Space Classification<br />
17
K NEAREST NEIGHBOR CLASSIFICATION<br />
• Unlike Rocchio, kNN classification determines the<br />
decision boundary locally.<br />
• For 1NN (k=1), we assign each document to the class<br />
of its closest neighbor.<br />
• For kNN, we assign each document to the majority<br />
class of its k closest neighbors. K here is a<br />
parameter.<br />
• The rationale of kNN: contiguity hypothesis.<br />
– We expect a test document d to have the same<br />
label as the training documents located nearly.<br />
TDT4215<br />
Vector Space Classification<br />
18
Knn: k=1<br />
TDT4215<br />
Vector Space Classification<br />
19
Knn: k=1 1,5,10 510<br />
TDT4215<br />
Vector Space Classification<br />
20
KNN: weighted sum voting<br />
TDT4215<br />
Vector Space Classification<br />
21
K NEAREST NEIGHBOR CLASSIFICATION<br />
Test<br />
Government<br />
Science<br />
Arts<br />
TDT4215<br />
Vector Space Classification<br />
22
K NEAREST NEIGHBOR CLASSIFICATION<br />
• Learning is just storing the representations of the<br />
training examples in D.<br />
• Testing instance x (under 1NN):<br />
– Compute similarity between x and all examples in D.<br />
– Assign x the category of the most similar example in D.<br />
• Does not explicitly compute a generalization or<br />
category prototypes.<br />
• Also called:<br />
– Case-based learning<br />
– Memory-based learning<br />
– Lazy learning<br />
• Rationale of kNN: contiguity hypothesis<br />
TDT4215<br />
Vector Space Classification<br />
23
SIMILARITY METRICS<br />
• Nearest neighbor method depends on a<br />
similarity (or distance) metric.<br />
• Simplest for continuous m-dimensional<br />
m instance space is Euclidean distance.<br />
• Simplest for m-dimensional m binary instance<br />
space is Hamming distance (number of<br />
feature values that differ).<br />
• For text, cosine similarity of tf.idf weighted<br />
vectors is typically most effective.<br />
TDT4215<br />
Vector Space Classification<br />
24
An Example: consine similarity<br />
r1<br />
r2<br />
r3<br />
t=b<br />
b2<br />
b1<br />
TDT4215<br />
Vector Space Classification<br />
25
Knn discusion<br />
•Functional definition of “similarity”<br />
•e.g. cos, Euclidean, kernel functions, ...<br />
•How many neighbors do we consider?<br />
•Value of k determined empirically (normally 3<br />
or 5 )<br />
•Does each neighbor get the same weight?<br />
•Weighted-sum or not<br />
TDT4215<br />
Vector Space Classification<br />
26
Knn discusion (cont.)<br />
• No feature selection necessary<br />
• Scales well with large number of classes<br />
– Dont n’t need to train n classifiers s for n classes<br />
s<br />
• Classes can influence each other<br />
– Small changes to one class can have ripple effect<br />
• Scores can be hard to convert to probabilities<br />
• No training i necessary<br />
– Actually: perhaps not true. (Data editing, etc.)<br />
• May be more expensive at test time<br />
TDT4215<br />
Vector Space Classification<br />
27
Text Classification<br />
Linear classifier and non-linear classifier<br />
TDT4215<br />
Vector Space Classification<br />
28
Linear Classifier<br />
• Many common text classifiers are linear<br />
classifiers<br />
– Naïve Bayes<br />
–Perceptron<br />
– Rocchio<br />
–Logistic regression<br />
– Support vector machines (with linear<br />
kernel)<br />
– Linear regression<br />
TDT4215<br />
Vector Space Classification<br />
29
Linear Classifier: 2d<br />
• In two dimensions, a linear classifier is a line.<br />
These lines have functional form<br />
w<br />
1<br />
x<br />
1<br />
w2x2<br />
b<br />
The classification rules:<br />
if w1<br />
x1<br />
w2x2<br />
b , => c<br />
if w x w x b , => not-c<br />
1 1 2 2<br />
Here:<br />
T<br />
( x<br />
1,<br />
x2)<br />
: 2D vector representation<br />
of the document<br />
T<br />
( w<br />
1,<br />
w2<br />
) : the parameter vector that<br />
t<br />
defines the boundary<br />
TDT4215<br />
Vector Space Classification<br />
30
Linear Classifier<br />
• We can generalize this 2D linear classifier to higher<br />
dimensions by defining a hyperplane:<br />
• The classification rules is then:<br />
–<br />
–<br />
• Why Rocchio and Naive Bayes classifiers are linear<br />
classfiers.<br />
TDT4215<br />
Vector Space Classification<br />
31
Non-linear classifier<br />
• Non-linear classifier: k-NN<br />
• A linear classifier e.g. Naive Bayes does badly on the<br />
task:<br />
kNN will do<br />
very well<br />
(assuming<br />
enough<br />
training data)<br />
TDT4215<br />
Vector Space Classification<br />
32
Text classification<br />
Classification i with more than two classes<br />
TDT4215<br />
Vector Space Classification<br />
33
Classification with more than two<br />
classes<br />
• Classification for classes that are not mutually<br />
exclusive is called any-of classification problem.<br />
• Classification for classes that are mutually exclusive<br />
is called one-of classification problem.<br />
• We have learned two-class linear classifiers.<br />
– linear classifier that can classify d to c or not-c.<br />
• How can we extend the two-class linear classifiers<br />
to J>2 classes.<br />
– to classify a document d to one of or any of<br />
classes c1, c2, c3…<br />
TDT4215<br />
Vector Space Classification<br />
34
Classification with more than two<br />
classes<br />
• For one-of of classification tasks:<br />
1. Build a classifier for each class, where the<br />
training set consists of the set of documents in<br />
the class and its complement.<br />
2. Given the test document, apply each classifier<br />
separately.<br />
3. Assign the document to the class with<br />
• the maximum score<br />
• the maximum confidence value<br />
• or the maximum probability<br />
TDT4215<br />
Vector Space Classification<br />
35
Classification with more than two<br />
classes<br />
• For any-of classification tasks:<br />
1. Build a classifier for each class, where the<br />
training set consists of the set of documents in<br />
the class and its compement.<br />
2. Given the test document, apply each classifier<br />
separately. The decision of one classifier has no<br />
influence on the decisions of the other<br />
classfier.<br />
TDT4215<br />
Vector Space Classification<br />
36
THE TEXT CLASSIFICATION PROBLEM<br />
An example:<br />
• Document d with only a sentance:<br />
“London is planning to organize the 2012 Olympics.”<br />
• We have six classes:<br />
, , , , , <br />
• Determined: and <br />
p(UK)p(d|UK) > t1<br />
p(sports)p(d|sports) > t2<br />
TDT4215<br />
Naive Bayes Text Classification<br />
37
summary<br />
• Vector space methods for Text<br />
Classification<br />
–Rocchio classification<br />
– K Nearest Neighbors<br />
• Linear classifier and non-linear classifier<br />
• Classification with more than two classes<br />
TDT4215<br />
Vector Space Classification<br />
38