12.07.2015 Views

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong><strong>Introduction</strong> <strong>to</strong><strong>Information</strong> <strong>Retrieval</strong>SCCS414: <strong>Information</strong> S<strong>to</strong>rage and <strong>Retrieval</strong>Chris<strong>to</strong>pher Manning and Prabhakar RaghavanLecture 12: Text Classification;Vec<strong>to</strong>r space classification (kNN)[Borrows slides from Ray Mooney]


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.3k Nearest Neighbor Classification• kNN = k Nearest Neighbor• To classify a document d in<strong>to</strong> class c:• Define k-neighborhood N as k nearest neighbors of d• Count number of documents i in N that belong <strong>to</strong> c• Estimate P(c|d) as i/k• Choose as class argmax c P(c|d) [ = majority class]2


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.3Example: k=6 (6NN)P(science| )?GovernmentScienceArts3


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.3Nearest-Neighbor Learning Algorithm• Learning is just s<strong>to</strong>ring the representations of the training examplesin D.• Testing instance x (under 1NN):• Compute similarity between x and all examples in D.• Assign x the category of the most similar example in D.• Does not explicitly compute a generalization or categorypro<strong>to</strong>types.• Also called:• Case-based learning• Memory-based learning• Lazy learning4


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>k Nearest Neighbor (kNN)Sec.14.3Classification• For k = 1 (1NN), we assign each test document <strong>to</strong> the class of itsnearest neighbor in the training set.• 1NN is not very robust – one document can be mislabeled oratypical.• For k > 1, we assign each test document <strong>to</strong> the majority class ofits k nearest neighbors in the training set.• This amounts <strong>to</strong> locally defined decision boundaries betweenclasses – far away points do not influence the classificationdecision. (different from Rocchio)• Rationale of kNN: contiguity hypothesis• We expect a test document d <strong>to</strong> have the same label as thetraining documents located in the local region surrounding d.


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.3kNN decision boundariesBoundariesare inprinciplearbitrarysurfaces –but usuallypolyhedraGovernmentScienceArtskNN gives locally defined decision boundaries betweenclasses – far away points do not influence each classificationdecision (unlike in Naïve Bayes, Rocchio, etc.)6


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>kNN is based on Voronoi tessellation1NN, 2NN,3NNclassificationdecision forstar?


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.3Similarity Metrics• Nearest neighbor method depends on a similarity (ordistance) metric.• Simplest for continuous m-dimensional instancespace is Euclidean distance.• For text, cosine similarity of tf.idf weighted vec<strong>to</strong>rs istypically most effective.8


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.3Illustration of 3 Nearest Neighbor for TextVec<strong>to</strong>r Space9


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>3 Nearest Neighbor vs. Rocchio• Nearest Neighbor tends <strong>to</strong> handle polymorphiccategories better than Rocchio/NB.10


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>kNN deals with multimodal classesbetter• The point O has more nearest neighbors in B than inA.O


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>kNN example1NN, 3NN classification for d 5 ?


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.3Nearest Neighbor with Inverted Index• Naively finding nearest neighbors requires a linearsearch through |D| documents in collection• But determining k nearest neighbors is the same asdetermining the k best retrievals using the testdocument as a query <strong>to</strong> a database of trainingdocuments.• Use standard vec<strong>to</strong>r space inverted index methods <strong>to</strong>find the k nearest neighbors.• Testing Time: O(B|V t |) where B is the averagenumber of training documents in which a test-document wordappears.• Typically B


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.3kNN: Discussion• No feature selection necessary• Scales well with large number of classes• Don’t need <strong>to</strong> train n classifiers for n classes• Classes can influence each other• Small changes <strong>to</strong> one class can have ripple effect• Scores can be hard <strong>to</strong> convert <strong>to</strong> probabilities• No training necessary• Actually: perhaps not true. (Data editing, etc.)• May be expensive at test time• In most cases it’s more accurate than NB or Rocchio14


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Linear classifiers• Definition: linear classifier is a classifier that decidesthe class membership by comparing a linearcombination of the features <strong>to</strong> a threshold.• Linear classifiers compute a linear combination orweighted sum w of the feature values.i ix i• Classification decision: w ?i ixi• (First, we only consider binary(2-class) classifiers.)• Geometrically, this corresponds <strong>to</strong> a line (2D), aplane (3D) or a hyperplane (higher dimensionalities)


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Linear classifiers• Assumption: The classes are linearly separable.• Can find hyperplane (=separa<strong>to</strong>r) based on trainingset• Methods for finding separa<strong>to</strong>r: Perceptron, Rocchio,Naïve Bayes – as we will explain on the next slides


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Example of a linear two-class classifier•This is for theclass interestin Reuters-21578.• For simplicity: assume a simple 0/1 vec<strong>to</strong>r representation• We assign document “rate discount dlrs world” <strong>to</strong>interest since = 0.67 ·1 + 0.46 ·1 + (−0.71) ·1 +(−0.35) ·1 = 0.07 > 0 = b.• We assign “prime dlrs” <strong>to</strong> the complement class (not ininterest) since = −0.01 ≤ b


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Rocchio separa<strong>to</strong>rs are linear classifiers thatcan be expressed asiwixi


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Two-class Rocchio as a linear classifier• Line or hyperplane defined by:Mi1w i d i • For Rocchio, set:w (c 1) (c 2) 0.5 (| (c 1) | 2 | (c 2) | 2 )


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Which Hyperplane?In general, lots of possiblesolutions for a,b,c.


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Which hyperplane?• For linearly separable training sets: there areinfinitely many separating hyperplanes.• They all separate the training set perfectly . . .• . . . but they behave differently on test data.• Error rates on new data are low for some, high forothers.


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Linear classifiers: Discussion• Many common text classifiers are linear classifiers: NaïveBayes, Rocchio, logistic regression, linear support vec<strong>to</strong>rmachines etc.• Each method has a different way of selecting the separatinghyperplane – huge differences in performance.• Can we get better performance with more powerful nonlinearclassifiers?• Not in general: A given amount of training data may suffice forestimating a linear boundary, but not for estimating a morecomplex nonlinear boundary.


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>A linear problem with noise• 3 noisedocuments


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>A nonlinear problem• A linear classifierlike Rocchio doesbadly on thistask• kNN will do verywell (assumingenough trainingdata)


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Which classifier do I use for a given TCproblem?• Is there a learning method that is optimal for all textclassification problems?• No, because there is a tradeoff between bias andvariance.• Fac<strong>to</strong>rs <strong>to</strong> take in<strong>to</strong> account:• How much training data is available?• How simple/complex is the problem? (linear vs. nonlineardecision boundary)• How noisy is the problem?• How stable is the problem over time?• For an unstable problem, it’s better <strong>to</strong> use a simple and robustclassifier.


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Bias vs. variance:Choosing the correct model capacitySec.14.6


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>How <strong>to</strong> combine hyperplanes for > 2classes?


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Sec.14.5Any-of vs. one-of problems• Any-of or multivalue or multilabel classification• A document can belong <strong>to</strong> 0, 1, or >1 classes.• Classes are independent of each other. A decision on oneclass leaves all decisions open on other classes.• Example: <strong>to</strong>pic classification• One-of or multiclass or poly<strong>to</strong>mous classification• Classes are mutually exclusive.• Each document belongs <strong>to</strong> exactly one class• Example: language of a document (assumption: nodocument contains multiple languages)


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Any-of classification• Simply run each two-class classifier separately on thetest document and assign document accordingly• Done


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>One-of classification• Run each two-class classifier separately• Assign document <strong>to</strong> class with:• maximum score• maximum confidence• maximum probability???


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>Summary: Representation ofText Categorization Attributes• Representations of text are usually very highdimensional (one feature for each word)• High-bias algorithms that prevent overfitting in highdimensionalspace generally work best• For most text categorization tasks, there are manyrelevant features and many irrelevant ones• Methods that combine evidence from many or allfeatures (e.g. naive Bayes, kNN, neural-nets) oftentend <strong>to</strong> work better than ones that try <strong>to</strong> isolate justa few relevant features (standard decision-tree orrule induction)**Although the results are a bit more mixed than often thought


<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong>References• Chapter 14 in IIR.• General overview of text classification: Sebastiani (2002)• Text classification chapter on decision tress and perceptrons:Manning & Sch¨utze (1999)• One of the best machine learning textbooks: Hastie, Tibshirani& Friedman (2003)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!