Introduction to Information Retrieval

Introduction to Information RetrievalIntroduction toInformation RetrievalSCCS414: Information Storage and RetrievalChristopher Manning and Prabhakar RaghavanLecture 12: Text Classification;Vector space classification (kNN)[Borrows slides from Ray Mooney]

Introduction to Information RetrievalSec.14.3k Nearest Neighbor Classification• kNN = k Nearest Neighbor• To classify a document d into class c:• Define k-neighborhood N as k nearest neighbors of d• Count number of documents i in N that belong to c• Estimate P(c|d) as i/k• Choose as class argmax c P(c|d) [ = majority class]2

Introduction to Information RetrievalSec.14.3Example: k=6 (6NN)P(science| )?GovernmentScienceArts3

Introduction to Information RetrievalSec.14.3Nearest-Neighbor Learning Algorithm• Learning is just storing the representations of the training examplesin D.• Testing instance x (under 1NN):• Compute similarity between x and all examples in D.• Assign x the category of the most similar example in D.• Does not explicitly compute a generalization or categoryprototypes.• Also called:• Case-based learning• Memory-based learning• Lazy learning4

Introduction to Information Retrievalk Nearest Neighbor (kNN)Sec.14.3Classification• For k = 1 (1NN), we assign each test document to the class of itsnearest neighbor in the training set.• 1NN is not very robust – one document can be mislabeled oratypical.• For k > 1, we assign each test document to the majority class ofits k nearest neighbors in the training set.• This amounts to locally defined decision boundaries betweenclasses – far away points do not influence the classificationdecision. (different from Rocchio)• Rationale of kNN: contiguity hypothesis• We expect a test document d to have the same label as thetraining documents located in the local region surrounding d.

Introduction to Information RetrievalSec.14.3kNN decision boundariesBoundariesare inprinciplearbitrarysurfaces –but usuallypolyhedraGovernmentScienceArtskNN gives locally defined decision boundaries betweenclasses – far away points do not influence each classificationdecision (unlike in Naïve Bayes, Rocchio, etc.)6

Introduction to Information RetrievalkNN is based on Voronoi tessellation1NN, 2NN,3NNclassificationdecision forstar?

Introduction to Information RetrievalSec.14.3Similarity Metrics• Nearest neighbor method depends on a similarity (ordistance) metric.• Simplest for continuous m-dimensional instancespace is Euclidean distance.• For text, cosine similarity of tf.idf weighted vectors istypically most effective.8

Introduction to Information RetrievalSec.14.3Illustration of 3 Nearest Neighbor for TextVector Space9

Introduction to Information Retrieval3 Nearest Neighbor vs. Rocchio• Nearest Neighbor tends to handle polymorphiccategories better than Rocchio/NB.10

Introduction to Information RetrievalkNN deals with multimodal classesbetter• The point O has more nearest neighbors in B than inA.O

Introduction to Information RetrievalkNN example1NN, 3NN classification for d 5 ?

Introduction to Information RetrievalSec.14.3Nearest Neighbor with Inverted Index• Naively finding nearest neighbors requires a linearsearch through |D| documents in collection• But determining k nearest neighbors is the same asdetermining the k best retrievals using the testdocument as a query to a database of trainingdocuments.• Use standard vector space inverted index methods tofind the k nearest neighbors.• Testing Time: O(B|V t |) where B is the averagenumber of training documents in which a test-document wordappears.• Typically B

Introduction to Information RetrievalSec.14.3kNN: Discussion• No feature selection necessary• Scales well with large number of classes• Don’t need to train n classifiers for n classes• Classes can influence each other• Small changes to one class can have ripple effect• Scores can be hard to convert to probabilities• No training necessary• Actually: perhaps not true. (Data editing, etc.)• May be expensive at test time• In most cases it’s more accurate than NB or Rocchio14

Introduction to Information RetrievalLinear classifiers• Definition: linear classifier is a classifier that decidesthe class membership by comparing a linearcombination of the features to a threshold.• Linear classifiers compute a linear combination orweighted sum w of the feature values.i ix i• Classification decision: w ?i ixi• (First, we only consider binary(2-class) classifiers.)• Geometrically, this corresponds to a line (2D), aplane (3D) or a hyperplane (higher dimensionalities)

Introduction to Information RetrievalLinear classifiers• Assumption: The classes are linearly separable.• Can find hyperplane (=separator) based on trainingset• Methods for finding separator: Perceptron, Rocchio,Naïve Bayes – as we will explain on the next slides

Introduction to Information RetrievalExample of a linear two-class classifier•This is for theclass interestin Reuters-21578.• For simplicity: assume a simple 0/1 vector representation• We assign document “rate discount dlrs world” tointerest since = 0.67 ·1 + 0.46 ·1 + (−0.71) ·1 +(−0.35) ·1 = 0.07 > 0 = b.• We assign “prime dlrs” to the complement class (not ininterest) since = −0.01 ≤ b

Introduction to Information RetrievalRocchio separators are linear classifiers thatcan be expressed asiwixi

Introduction to Information RetrievalTwo-class Rocchio as a linear classifier• Line or hyperplane defined by:Mi1w i d i • For Rocchio, set:w (c 1) (c 2) 0.5 (| (c 1) | 2 | (c 2) | 2 )

Introduction to Information RetrievalWhich Hyperplane?In general, lots of possiblesolutions for a,b,c.

Introduction to Information RetrievalWhich hyperplane?• For linearly separable training sets: there areinfinitely many separating hyperplanes.• They all separate the training set perfectly . . .• . . . but they behave differently on test data.• Error rates on new data are low for some, high forothers.

Introduction to Information RetrievalLinear classifiers: Discussion• Many common text classifiers are linear classifiers: NaïveBayes, Rocchio, logistic regression, linear support vectormachines etc.• Each method has a different way of selecting the separatinghyperplane – huge differences in performance.• Can we get better performance with more powerful nonlinearclassifiers?• Not in general: A given amount of training data may suffice forestimating a linear boundary, but not for estimating a morecomplex nonlinear boundary.

Introduction to Information RetrievalA linear problem with noise• 3 noisedocuments

Introduction to Information RetrievalA nonlinear problem• A linear classifierlike Rocchio doesbadly on thistask• kNN will do verywell (assumingenough trainingdata)

Introduction to Information RetrievalWhich classifier do I use for a given TCproblem?• Is there a learning method that is optimal for all textclassification problems?• No, because there is a tradeoff between bias andvariance.• Factors to take into account:• How much training data is available?• How simple/complex is the problem? (linear vs. nonlineardecision boundary)• How noisy is the problem?• How stable is the problem over time?• For an unstable problem, it’s better to use a simple and robustclassifier.

Introduction to Information RetrievalBias vs. variance:Choosing the correct model capacitySec.14.6

Introduction to Information RetrievalHow to combine hyperplanes for > 2classes?

Introduction to Information RetrievalSec.14.5Any-of vs. one-of problems• Any-of or multivalue or multilabel classification• A document can belong to 0, 1, or >1 classes.• Classes are independent of each other. A decision on oneclass leaves all decisions open on other classes.• Example: topic classification• One-of or multiclass or polytomous classification• Classes are mutually exclusive.• Each document belongs to exactly one class• Example: language of a document (assumption: nodocument contains multiple languages)

Introduction to Information RetrievalAny-of classification• Simply run each two-class classifier separately on thetest document and assign document accordingly• Done

Introduction to Information RetrievalOne-of classification• Run each two-class classifier separately• Assign document to class with:• maximum score• maximum confidence• maximum probability???

Introduction to Information RetrievalSummary: Representation ofText Categorization Attributes• Representations of text are usually very highdimensional (one feature for each word)• High-bias algorithms that prevent overfitting in highdimensionalspace generally work best• For most text categorization tasks, there are manyrelevant features and many irrelevant ones• Methods that combine evidence from many or allfeatures (e.g. naive Bayes, kNN, neural-nets) oftentend to work better than ones that try to isolate justa few relevant features (standard decision-tree orrule induction)**Although the results are a bit more mixed than often thought

Introduction to Information RetrievalReferences• Chapter 14 in IIR.• General overview of text classification: Sebastiani (2002)• Text classification chapter on decision tress and perceptrons:Manning & Sch¨utze (1999)• One of the best machine learning textbooks: Hastie, Tibshirani& Friedman (2003)

Introduction to Information Retrieval

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?