21.01.2014 Views

VECTOR SPACE CLASSiFiCATiON VECTOR SPACE ...

VECTOR SPACE CLASSiFiCATiON VECTOR SPACE ...

VECTOR SPACE CLASSiFiCATiON VECTOR SPACE ...

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

<strong>VECTOR</strong> <strong>SPACE</strong> CLASSIFICATION<br />

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,<br />

Introduction to Information Retrieval, Cambridge University Press.<br />

Chapter 14<br />

Wei Wei<br />

wwei@idi.ntnu.no<br />

Lecture series<br />

TDT4215 Vector Space Classification 1


RecALL: Naïve Bayes classifiers<br />

• Classify based on prior weight of class and conditional parameter for<br />

what each word says:<br />

<br />

<br />

c NB<br />

argmax<br />

log P(c j<br />

) log P(x i<br />

| c j<br />

)<br />

<br />

c j C <br />

i positions <br />

• Training is done by counting and dividing:<br />

P(c j<br />

) N c j<br />

N<br />

• Don’t forget to smooth<br />

P(x k<br />

| c j<br />

) T c j x k<br />

<br />

<br />

xi V<br />

[T c j x i<br />

]<br />

TDT4215<br />

Vector Space Classification<br />

2


Vector <strong>SPACE</strong> text classification<br />

• Today:<br />

– Vector space methods for Text<br />

Classification<br />

• Rocchio classification<br />

• K Nearest Neighbors<br />

– Linear classifier and non-linear classifier<br />

– Classification with more than two classes<br />

TDT4215<br />

Vector Space Classification<br />

3


Vector <strong>SPACE</strong> text classification<br />

Vector space methods for<br />

Text Classification<br />

TDT4215<br />

Vector Space Classification<br />

4


<strong>VECTOR</strong> <strong>SPACE</strong> CLASSIFICATION<br />

• Vector Space Representation<br />

– Each document is a vector, one component for<br />

each term (= word).<br />

– Normally normalize vectors to unit length.<br />

– High-dimensional vector space:<br />

• Terms are axes<br />

• 10,000+ 000+ dimensions, or even 100,000+ 000+<br />

• Docs are vectors in this space<br />

– How can we do classification in this space?<br />

TDT4215<br />

Vector Space Classification<br />

5


<strong>VECTOR</strong> <strong>SPACE</strong> CLASSIFICATION<br />

• As before, the training set is a set of documents,<br />

each labeled with its class (e.g., topic)<br />

• In vector space classification, this set corresponds<br />

to a labeled set of points (or, equivalently, vectors)<br />

in the vector space<br />

• Hypothesis 1: Documents in the same class form a<br />

contiguous region of space<br />

• Hypothesis 2: Documents from different classes<br />

don’t overlap<br />

• We define surfaces to delineate classes in the space<br />

TDT4215<br />

Vector Space Classification<br />

6


Documents in a vector space<br />

Government<br />

Science<br />

Arts<br />

TDT4215<br />

Vector Space Classification<br />

7


Test document: which class?<br />

Government<br />

Science<br />

Arts<br />

TDT4215<br />

Vector Space Classification<br />

8


Test document = government<br />

Government<br />

Science<br />

Arts<br />

Our main topic today is how to find good separators<br />

TDT4215<br />

Vector Space Classification<br />

9<br />

9


Vector <strong>SPACE</strong> text classification<br />

Rocchio text classification<br />

i<br />

TDT4215<br />

Vector Space Classification<br />

10


Rocchio text classification<br />

• Rocchio Text Classification<br />

– Use standard tf-idf weighted vectors to<br />

represent text documents<br />

– For training documents in each category, compute<br />

a prototype vector by summing the vectors of the<br />

training documents in the category<br />

• Prototype = centroid of fmembers m of class<br />

– Assign test documents to the category with the<br />

closest prototype vector based on cosine<br />

similarity<br />

TDT4215<br />

Vector Space Classification<br />

11


DEFINITION OF CENTROID<br />

<br />

(c) 1 <br />

v (d)<br />

| D <br />

c<br />

| d Dc<br />

• Where D c is the set of all documents that belong<br />

to class c and v (d) is the vector space<br />

representation of d.<br />

• Note that centroid will in general not be a unit<br />

vector even when the inputs are unit vectors.<br />

TDT4215<br />

Vector Space Classification<br />

12


ROCCHIO TEXT CLASSIFICATION<br />

r<br />

r1<br />

r2<br />

r3<br />

b1<br />

b2<br />

t=b<br />

b<br />

TDT4215<br />

Vector Space Classification<br />

13


ROCCHIO PROPERTIES<br />

• Forms a simple generalization of the<br />

examples in each class (a prototype).<br />

• Prototype vector does not need to be<br />

averaged or otherwise normalized for length<br />

since cosine similarity is insensitive to<br />

vector length.<br />

• Classification is based on similarity to class<br />

prototypes.<br />

• Does not guarantee classifications are<br />

consistent with the given training data.<br />

TDT4215<br />

Vector Space Classification<br />

14


ROCCHIO ANOMALY<br />

• Prototype models have problems with polymorphic (disjunctive)<br />

categories.<br />

r<br />

r1<br />

r2<br />

t=r<br />

b<br />

b1<br />

b2<br />

r3<br />

r4<br />

TDT4215<br />

Vector Space Classification<br />

15


Vector <strong>SPACE</strong> text classification<br />

k Nearest Neighbor Classification<br />

i<br />

TDT4215<br />

Vector Space Classification<br />

16


K NEAREST NEIGHBOR CLASSIFICATION<br />

• kNN = k Nearest Neighbor<br />

• To classify document d into class c:<br />

• Define k-neighborhood N as k nearest neighbors of<br />

d<br />

• Count number of documents i in N that belong to c<br />

• Estimate P(c|d) as i/k<br />

• Choose as class argmax c P(c|d) [ = majority class]<br />

TDT4215<br />

Vector Space Classification<br />

17


K NEAREST NEIGHBOR CLASSIFICATION<br />

• Unlike Rocchio, kNN classification determines the<br />

decision boundary locally.<br />

• For 1NN (k=1), we assign each document to the class<br />

of its closest neighbor.<br />

• For kNN, we assign each document to the majority<br />

class of its k closest neighbors. K here is a<br />

parameter.<br />

• The rationale of kNN: contiguity hypothesis.<br />

– We expect a test document d to have the same<br />

label as the training documents located nearly.<br />

TDT4215<br />

Vector Space Classification<br />

18


Knn: k=1<br />

TDT4215<br />

Vector Space Classification<br />

19


Knn: k=1 1,5,10 510<br />

TDT4215<br />

Vector Space Classification<br />

20


KNN: weighted sum voting<br />

TDT4215<br />

Vector Space Classification<br />

21


K NEAREST NEIGHBOR CLASSIFICATION<br />

Test<br />

Government<br />

Science<br />

Arts<br />

TDT4215<br />

Vector Space Classification<br />

22


K NEAREST NEIGHBOR CLASSIFICATION<br />

• Learning is just storing the representations of the<br />

training examples in D.<br />

• Testing instance x (under 1NN):<br />

– Compute similarity between x and all examples in D.<br />

– Assign x the category of the most similar example in D.<br />

• Does not explicitly compute a generalization or<br />

category prototypes.<br />

• Also called:<br />

– Case-based learning<br />

– Memory-based learning<br />

– Lazy learning<br />

• Rationale of kNN: contiguity hypothesis<br />

TDT4215<br />

Vector Space Classification<br />

23


SIMILARITY METRICS<br />

• Nearest neighbor method depends on a<br />

similarity (or distance) metric.<br />

• Simplest for continuous m-dimensional<br />

m instance space is Euclidean distance.<br />

• Simplest for m-dimensional m binary instance<br />

space is Hamming distance (number of<br />

feature values that differ).<br />

• For text, cosine similarity of tf.idf weighted<br />

vectors is typically most effective.<br />

TDT4215<br />

Vector Space Classification<br />

24


An Example: consine similarity<br />

r1<br />

r2<br />

r3<br />

t=b<br />

b2<br />

b1<br />

TDT4215<br />

Vector Space Classification<br />

25


Knn discusion<br />

•Functional definition of “similarity”<br />

•e.g. cos, Euclidean, kernel functions, ...<br />

•How many neighbors do we consider?<br />

•Value of k determined empirically (normally 3<br />

or 5 )<br />

•Does each neighbor get the same weight?<br />

•Weighted-sum or not<br />

TDT4215<br />

Vector Space Classification<br />

26


Knn discusion (cont.)<br />

• No feature selection necessary<br />

• Scales well with large number of classes<br />

– Dont n’t need to train n classifiers s for n classes<br />

s<br />

• Classes can influence each other<br />

– Small changes to one class can have ripple effect<br />

• Scores can be hard to convert to probabilities<br />

• No training i necessary<br />

– Actually: perhaps not true. (Data editing, etc.)<br />

• May be more expensive at test time<br />

TDT4215<br />

Vector Space Classification<br />

27


Text Classification<br />

Linear classifier and non-linear classifier<br />

TDT4215<br />

Vector Space Classification<br />

28


Linear Classifier<br />

• Many common text classifiers are linear<br />

classifiers<br />

– Naïve Bayes<br />

–Perceptron<br />

– Rocchio<br />

–Logistic regression<br />

– Support vector machines (with linear<br />

kernel)<br />

– Linear regression<br />

TDT4215<br />

Vector Space Classification<br />

29


Linear Classifier: 2d<br />

• In two dimensions, a linear classifier is a line.<br />

These lines have functional form<br />

w<br />

1<br />

x<br />

1<br />

w2x2<br />

b<br />

The classification rules:<br />

if w1<br />

x1<br />

w2x2<br />

b , => c<br />

if w x w x b , => not-c<br />

1 1 2 2<br />

Here:<br />

T<br />

( x<br />

1,<br />

x2)<br />

: 2D vector representation<br />

of the document<br />

T<br />

( w<br />

1,<br />

w2<br />

) : the parameter vector that<br />

t<br />

defines the boundary<br />

TDT4215<br />

Vector Space Classification<br />

30


Linear Classifier<br />

• We can generalize this 2D linear classifier to higher<br />

dimensions by defining a hyperplane:<br />

• The classification rules is then:<br />

–<br />

–<br />

• Why Rocchio and Naive Bayes classifiers are linear<br />

classfiers.<br />

TDT4215<br />

Vector Space Classification<br />

31


Non-linear classifier<br />

• Non-linear classifier: k-NN<br />

• A linear classifier e.g. Naive Bayes does badly on the<br />

task:<br />

kNN will do<br />

very well<br />

(assuming<br />

enough<br />

training data)<br />

TDT4215<br />

Vector Space Classification<br />

32


Text classification<br />

Classification i with more than two classes<br />

TDT4215<br />

Vector Space Classification<br />

33


Classification with more than two<br />

classes<br />

• Classification for classes that are not mutually<br />

exclusive is called any-of classification problem.<br />

• Classification for classes that are mutually exclusive<br />

is called one-of classification problem.<br />

• We have learned two-class linear classifiers.<br />

– linear classifier that can classify d to c or not-c.<br />

• How can we extend the two-class linear classifiers<br />

to J>2 classes.<br />

– to classify a document d to one of or any of<br />

classes c1, c2, c3…<br />

TDT4215<br />

Vector Space Classification<br />

34


Classification with more than two<br />

classes<br />

• For one-of of classification tasks:<br />

1. Build a classifier for each class, where the<br />

training set consists of the set of documents in<br />

the class and its complement.<br />

2. Given the test document, apply each classifier<br />

separately.<br />

3. Assign the document to the class with<br />

• the maximum score<br />

• the maximum confidence value<br />

• or the maximum probability<br />

TDT4215<br />

Vector Space Classification<br />

35


Classification with more than two<br />

classes<br />

• For any-of classification tasks:<br />

1. Build a classifier for each class, where the<br />

training set consists of the set of documents in<br />

the class and its compement.<br />

2. Given the test document, apply each classifier<br />

separately. The decision of one classifier has no<br />

influence on the decisions of the other<br />

classfier.<br />

TDT4215<br />

Vector Space Classification<br />

36


THE TEXT CLASSIFICATION PROBLEM<br />

An example:<br />

• Document d with only a sentance:<br />

“London is planning to organize the 2012 Olympics.”<br />

• We have six classes:<br />

, , , , , <br />

• Determined: and <br />

p(UK)p(d|UK) > t1<br />

p(sports)p(d|sports) > t2<br />

TDT4215<br />

Naive Bayes Text Classification<br />

37


summary<br />

• Vector space methods for Text<br />

Classification<br />

–Rocchio classification<br />

– K Nearest Neighbors<br />

• Linear classifier and non-linear classifier<br />

• Classification with more than two classes<br />

TDT4215<br />

Vector Space Classification<br />

38

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!