Lecture 3 Modeling.pdf

<strong>Lecture</strong> 3 <strong>Modeling</strong>3-1

Ranking• central problem of IR– Predict which documents are relevant and which are not• Ranking– Establish an ordering of the documents retrieved• IR models– Different model provides distinct sets of premises todeal with document relevance3-2

Information Retrieval Models• Classic Models– Boolean model• set theoretic• documents and queries are represented as sets of index terms• compare Boolean query statements with the term sets used toidentify document content.– Vector model• algebraic model• documents and queries are represented as vectors in a t-dimensional space• compute global similarities between queries and documents.– Probabilistic model• probabilistic• documents and queries are represented on the basis ofprobabilistic theory• compute the relevance probabilities for the documents of acollection.3-3

Information Retrieval Models(Continued)• Structured Models– reference to the structure present in written text– non-overlapping list model– proximal nodes model• Browsing– flat– structured guided– hypertext3-4

Taxonomy of Information Retrieval ModelsClassic ModelsBooleanvectorprobabilisticSet TheoreticFuzzyExtended BooleanUSERTASKRetrieval:Ad hocFilteringBrowsingStructured ModelsNon-OverlappedListsProximal NodesBrowsingFlatStructured GuidedHypertextAlgebraicGeneralized VectorLat. Semantic IndexNeural NetworkProbabilisticInference NetworkBrief Network3-5

Issues of a retrieval system• Models– Boolean– vector– probabilistic• Logical views of documents– full text– set of index terms• User task– retrieval– browsing3-6

Combinations of these issuesUSERTASKRetrievalBrowsingLOGICAL VIEW OF DOCUMENTSFull Text+Index Terms Full TextStructureClassicSet TheoreticAlgebraicProbabilisticFlatClassicSet TheoreticAlgebraicProbabilisticFlatHypertextStructuredStructure GuidedHypertext3-7

Retrieval: Ad hoc and Filtering• Ad hoc retrieval– Documents remain relatively static while new queriesare submitted• Filtering– Queries remain relatively static while new documentscome into the system• e.g., news wiring services in the stock market– User profile describes the user’s preferences• Filtering task indicates to the user which document might beinterested to him• Which ones are really relevant is fully reserved to the user– Routing: a variation of filtering• Ranking filtered documents and show this ranking to users3-8

User profile• Simplistic approach– The profile is described through a set ofkeywords– The user provides the necessary keywords• Elaborate approach– Collect information from the user– initial profile + relevance feedback (relevantinformation and nonrelevant information)3-9

Formal Definition of IR Models• /D, Q, F, R(q i , d j )/– D: a set composed of logical views (or representations)for the documents in collection– Q: a set composed of logical views (or representations)for the user information needsquery– F: a framework for modeling documentsrepresentations, queries, and their relationships– R(q i , d j ): a ranking function which associations a realnumber with q i ∈Q and d j ∈D3-10

Formal Definition of IR Models(continued)• classic Boolean model– set of documents– standard operations on sets• classic vector model– t-dimensional vector space– standard linear algebra operations on vector• classic probabilistic model– sets– standard probabilistic operations, and Bayes’ theorem3-11

Basic Concepts of Classic IR• index terms (usually nouns): index and summarize• weight of index terms• Definition– K={k 1 , …, k t }: a set of all index terms– w i,j : a weight of an index term k i of a document d j– d j =(w 1,j , w 2,j , …, w t,j ): an index term vector for thedocument d j– g i (d j )= w w i,ji,j associated with (k i ,d j ) tells us nothingabout w i+1,j associated with (k i+1 ,d j )• assumption– index term weights are mutually independentThe terms computer and network in the area of computer networks3-12

Boolean Model• The index term weight variables are allbinary, i.e., w i,j ∈{0,1}• A query q is a Boolean expression (and, or, not)• q dnf : the disjunctive normal form for q• q cc : conjunctive components of q dnf• sim(d j ,q): similarity of d j to q– 1: if ∃q cc | (q cc ∈q dnf ∧(∀k i , g i (d j )=g i (q cc ))– 0: otherwisedj is relevant to q3-13

• ExampleBoolean Model (Continued)– q=k a ∧ (k b ∨¬k c )– q dnf =(1,1,1) ∨ (1,1,0) ∨ (1,0,0)(k a ∧ k b ) ∨ (k a ∧¬k c )= (k a ∧ k b ∧ k c ) ∨ (k a ∧ k b ∧¬k c )∨(k a ∧ k b ∧¬k c ) ∨(k a ∧¬k b ∧¬k c )= (k a ∧ k b ∧ k c ) ∨ (k a ∧ k b ∧¬k c ) ∨(k a ∧¬k b ∧¬k c )k a(1,0,0)(1,1,0)k b(1,1,1)k c3-14

Boolean Model (Continued)• advantage: simple• disadvantage– binary decision (relevant or non-relevant)without grading scale– exact match (no partial match)• e.g., d j =(0,1,0) is non-relevant to q=k a ∧ (k b ∨¬k c )– retrieve too few or too many documents3-15

Basic Vector Space Model• Term vector representation ofdocuments D i =(a i1 , a i2 , …, a it )queries Q j =(q j1 , q j2 , …, q jt )• t distinct terms are used to characterize content.• Each term is identified with a term vector T.• t vectors are linearly independent.• Any vector is represented as a linear combination of the tterm vectors.• The rth document D r can be represented as a documentvector, written astD=a iTr ∑ r ii=13-16

Document representation in vector spacea document vector in a two-dimensional vector space3-17

Similarity Measure• measure by product of two vectorsx • y = |x| |y| cosα• document-query similaritydocument vector:Dtr = ∑ariTii=1t∑i, j=1Dr‧ Qs=aqT risji‧Tterm vector:• how to determine the vector components and termcorrelations?jQ=t∑ qj=1Ts sj j3-18

Similarity Measure (Continued)• vector componentsT T T T1 2 3tAD a a L a1D a a L a2= M M M M M11 12 1t21 22 2tDna a L a1 2n n nt3-19

Similarity Measure (Continued)• term correlations T i • T j are not availableassumption: term vectors are orthogonalT i • T j =0 (i≠j) T i • T j =1 (i=j)• Assume that terms are uncorrelated.sim( D , Q )tr s = ∑ariqsi,j=1• Similarity measurement between documentssim( D , D )tr s = ∑ariasi,j=1jj3-20

Sample query-documentsimilarity computation• D 1 =2T 1 +3T 2 +5T 3 D 2 =3T 1 +7T 2 +1T 3Q=0T 1 +0T 2 +2T 3• similarity computations for uncorrelated termssim(D 1 ,Q)=2•0+3 •0+5 •2=10sim(D 2 ,Q)=3•0+7 •0+1 •2=2• D 1 is preferred3-21

Sample query-documentsimilarity computation (Continued)• T 1 T 2 T 3T 1 1 0.5 0T 2 0.5 1 -0.2T 3 0 -0.2 1• similarity computations for correlated termssim(D 1 ,Q)=(2T 1 +3T 2 +5T 3 ) • (0T 1 +0T 2 +2T 3 )=4T 1 •T 3 +6T 2 •T 3 +10T 3 •T 3=-6*0.2+10*1=8.8sim(D 2 ,Q)=(3T 1 +7T 2 +1T 3 ) • (0T 1 +0T 2 +2T 3 )=6T 1 •T 3 +14T 2 •T 3 +2T 3 •T 3=-14*0.2+2*1=-0.8• D 1 is preferred3-22

Vector Model• w i,j : a positive, non-binary weight for (k i ,d j )• w i,q : a positive, non-binary weight for (k i ,q)• q=(w 1,q , w 2,q , …, w t,q ): a query vector,where t is the total number of index terms inthe system• d j = (w 1,j , w 2,j , …, w t,j ): a document vector3-23

Similarity of document d j w.r.t. query q• The correlation between vectors d j and qsim(dj, q)=| ddjj• q| × | q |cos(d j ,q)d j=∑t∑i=1wi,j × wi,qtti=1wi2 , j × ∑ j = 1wi2 , qθQ• | q | does not affect the ranking• | d j | provides a normalization3-24

document ranking• Similarity (i.e., sim(q, d j )) varies from 0 to 1.• Retrieve the documents with a degree ofsimilarity above a predefined threshold(allow partial matching)3-25

term weighting techniques• IR problem: one of clustering– user query: a specification of a set A of objects– clustering problem: determine which documents are inthe set A (relevant), which ones are not (non-relevant)– intra-cluster similarity• the features better describe the objects in the set A• tf factor in vector modelthe raw frequency of a term k i inside a document d j– inter-cluster dissimilarity• the features better distinguish the the objects in the set A fromthe remaining objects in the collection C• idf factor (inverse document frequency) in vector modelthe inverse of the frequency of a term k i among the documentsin the collection3-26

Definition of tf• N: total number of documents in the system• n i : the number of documents in which theindex term k i appears• freq i,j : the raw frequency of term k i in thedocument d j(0~1)• f i,j : the normalized frequency of term k i indocument d jfi,j=maxfreqli,jfreql,jTerm t l has maximum frequencyin the document d3-27 j

Definition of idf andtf-idf scheme• idf i : inverse document frequency for k iidfN= login i• w i,j : term-weighting by tf-idf schemewi, j = fi,j ×Nlogn• query term weight (Salton and Buckley)(a very short document)w0.5 freqi,qi, q = (0.5 +) ×maxlfreqi,qilogNnifreq i,q : the raw frequency of the term k i in q3-28

Analysis of vector model• advantages– its term-weighting scheme improves retrievalperformance– its partial matching strategy allows retrieval ofdocuments that approximate the query conditions– its cosine ranking formula sorts the documentsaccording to their degree of similarity to the query• disadvantages– indexed terms are assumed to be mutuallyindependently3-29

Probabilistic Model• Given a query, there is an ideal answer set– a set of documents which contains exactly therelevant documents and no other• query process– a process of specifying the properties of anideal answer set• problem: what are the properties?3-30

Probabilistic Model (Continued)• Generate a preliminary probabilisticdescription of the ideal answer set• Initiate an interaction with the user– User looks at the retrieved documents anddecide which ones are relevant and which onesare not– System uses this information to refine thedescription of the ideal answer set– Repeat the process many times.3-31

Probabilistic Principle• Given a user query q and a document d j in thecollection, the probabilistic model estimates theprobability that user will find d j relevant• assumptions– The probability of relevance depends on query anddocument representations only– There is a subset of all documents which the userprefers as the answer set for the query q• Given a query, the probabilistic model assigns toeach document dj a measure of its similarity to thequeryP(dP(djjrelevant − to q)nonrelevant − to q)3-32

Probabilistic Principle• w i,j ∈{0,1}, w i,q ∈{0,1}: the index term weight variablesare all binary non-relevant• q: a query which is a subset of index terms• R: the set of documents known to be relevant• R (complement of R): the set of non-relevant documents• P(R|d j ): the probability that the document d j is relevantto the query q• P(R|d j ): the probability that d j is non-relevant to q3-33

similarity• sim(d j ,q): the similarity of the document d jto the query qP(R | d j )sim ( d j , q)= (by definition)P(R | d )jsim(dj, q)P(dP(djj| R)× P(R)= (Bayes’ rule)| R)× P(R)P ( X | Y ) =P(X ) P(Y |P(Y )X )P(d j | R)sim(d j , q)≈ (P(R) and P(R) are theP(d | R)jsame for all documents)P( d j | R): the probability of randomly selecting the documentd j from the set of R of relevant documentsP(R): the probability that a document randomly selected from3-34the entire collection is relevant

3-35independence assumption ofindex terms∑∑∑∑∑∑∏∏======−−=−=−+−×−××=+×××=××××=××=××=≈tiiiiiiiijtiitiiiiiiiijtiitiiqgdgiiiqgdgiitiqgdgiqgdgiqgdgiitiqgdgiqgdgitiqgdgiqgdgijjjRkPRkPRkPRkPRkPRkPqgdgRkPRkPRkPRkPRkPRkPqgdgRkPRkPRkPRkPRkPRkPRkPRkPRkPRkPRkPRkPRkPRkPRdPRdPqdsimijiijiijiijiijiqgid jgiijiijiijiiji11111)()()()(1)()(1)()()()(11)()(1)()(1)()(1)()()|()|(log))|((1)|())|((1)|(log)()()|()|(log)|()|()|()|(log)()())|(())|()|(())|(())|()|((log))|(())|(())|(())|((log))|(())|(())|(())|((log)|()|(),()()(

3-36))|())|((1log)))|((1)|((log)()()|()|(log))|())|((1log)))|((1)|((log)()()|()|(log))|((1)|())|((1)|(log)()()|()|(),(11111RkPRkPRkPRkPqgdgRkPRkPRkPRkPRkPRkPqgdgRkPRkPRkPRkPRkPRkPqgdgRdPRdPqdsimiiiiijtiitiiiiiiiijtiitiiiiiiiijtiijjj−+−×≈+−+−×=+−×−××=≈∑∑∑∑∑=====Problem: where is the set R?

Initial guess• P(k i |R) is constant for all index terms k i .p( k i | R)= 0.5nP(k Rii | ) =N( 假設 N>>|R|,N≈|R|)• The distribution of index terms among thenon-relevant documents can beapproximated by the distribution of indexterms among all the documents in thecollection.3-37

Initial ranking• V: a subset of the documents initially retrievedand ranked by the probabilistic model (top rdocuments)• V i : subset of V composed of documents whichcontain the index term k i• Approximate P(k i |R) by the distribution of theindex term k i among the documents retrieved sofar.VP(k Rii | ) =V• Approximate P(k i |R) by considering that all thenon-retrieved documents are not relevant.P(ki| R)=ni−ViN −V3-38

Small values of V and V i• alternative 1VP(k Rii | ) =Vni−ViP(ki| R)=N −Va problem when V=1 and V i =0P(kP(kii||R)R)V + 0.5=iV + 1n − + 0.5=i ViN −V+ 1• alternative 2P(kP(kii||R)R)nV +ii= NV + 1nn −V+ii i= NN −V+ 13-39

Probabilistic Model– Q: “gold silver truck”D1: “Shipment of gold damaged in a fire”D2: “Delivery of silver arrived in a silver truck”D3: “Shipment of gold arrived in a truck”– IDF (Select Keywords)• a = in = of = 0 = log 3/3arrived = gold = shipment = truck = 0.176 = log 3/2damaged = delivery = fire = silver = 0.477 = log 3/1– 8 Keywords (Dimensions) are selected• arrived(1), damaged(2), delivery(3), fire(4), gold(5),silver(6), shipment(7), truck(8)3-40

• Initial GuessProbabilistic Model3-41

Probabilistic Model• Interaction with User?– Relevance Feedback• How many documents need to be retrieved?3-42

No Interaction with User• Retrieve 1 Document: d2 (relevant)3-43

No Interaction with User• Retrieve 2 Documents: d2 (relevant) & d13-44

No Interaction with User• Retrieve 3 Documents: d2, d1 (non-relevant) &d3We need to interact with 3-45 user.

Interaction with User• Retrieve 2 Documents: d2 & d1 (non-relevant)Rr0.5 0.5 0.5n0.5N3-46

Interaction with User• Alternative 2 Alternative 3• Alternative 43-47

Interaction with User3-48

Analysis of Probabilistic Model• advantage– documents are ranked in decreasing order oftheir probability of being relevant• disadvantages– the need to guess the initial separation ofdocuments into relevant and non-relevant sets– do not consider the frequency with which anindex terms occurs inside a document– the independence assumption for index terms3-49

Comparison of classic models• Boolean model: the weakest classic model• Vector model is expected to outperform theprobabilistic model with general collections(Salton and Buckley)3-50

Alternative Set Theoretic Models-Fuzzy Set Model• Model– a query term: a fuzzy set– a document: degree of membership in this set– membership function• Associate membership function with the elements ofthe class• 0: no membership in the set• 1: full membership• 0~1: marginal elements of the setdocuments3-51

Fuzzy Set Theorya class• A fuzzy subset A of a universe of discourseU is characterized by a membershipfunction µ A : U→[0,1] which associates witheach element u of U a number µ A (u) in theinterval [0,1]– complement:– union: A∪B– intersection:a documentμ ( u)= 1−μ ( u)Aμ ( u)= max( μ ( u),μ ( u))Aμ ( u)= min( μ ( u),μ ( u))A∩BAABB3-52

Examples• Assume U={d 1 , d 2 , d 3 , d 4 , d 5 , d 6 }• Let A and B be {d 1 , d 2 , d 3 } and {d 2 , d 3 , d 4 },respectively.• Assume μ A ={d 1 :0.8, d 2 :0.7, d 3 :0.6, d 4 :0, d 5 :0, d 6 :0} andμ B ={d 1 :0, d 2 :0.6, d 3 :0.8, d 4 :0.9, d 5 :0, d 6 :0}μ( u)= 1−μ ( u)• A ={d 1 :0.2, d 2 :0.3, d 3 :0.4, d 4 :1, d 5 :1, d 6 :1}A• μ A∪B( u)= max( μ A(u),μB( u))={d 1 :0.8, d 2 :0.7, d 3 :0.8, d 4 :0.9,d 5 :0, d 6 :0}μ ( u)= min( μ ( u),μ ( u))• A∩BA B ={d 1 :0, d 2 :0.6, d 3 :0.6, d 4 :0,d 5 :0, d 6 :0}3-53

Fuzzy AND3-54

Fuzzy OR3-55

Fuzzy NOT3-56

Fuzzy Information Retrieval• basic idea– Expand the set of index terms in the query withrelated terms (from the thesaurus) such thatadditional relevant documents can be retrieved– A thesaurus can be constructed by defining aterm-term correlation matrix c whose rows andcolumns are associated to the index terms in thedocument collectionkeyword connection matrix3-57

Fuzzy Information Retrieval(Continued)• normalized correlation factor c i,l betweentwo terms k i and k l (0~1)ci,l=nin+ ni,ll− ni,lwheren i is # of documents containing term k in l is # of documents containing term k ln i,l is # of documents containing k i and k l• In the fuzzy set associated to each indexterm k i , a document d j has a degree ofmembership µ i,jμ i j = 1 − (1 − ci,, ∏k l ∈d jl)3-58

Fuzzy Information Retrieval(Continued)• physical meaning– A document d j belongs to the fuzzy set associated to theterm k i if its own terms are related to k i , i.e., μ i,j =1.– If there is at least one index term k l of d j which isstrongly related to the index k i , then μ i,j ∼1.k i is a good fuzzy index– When all index terms of d j are only loosely related to k i ,μ i,j ∼0.k i is not a good fuzzy index3-59

Example• q=(k a ∧ (k b ∨¬k c )=(k a ∧ k b ∧ k c ) ∨ (k a ∧ k b ∧¬k c ) ∨(k a ∧¬k b ∧¬k c )D a=cc 1 +cc 2 +cc 3cc 3cc 2cc 1D a : the fuzzy set of documentsassociated to the index k ad j ∈D a has a degree of membershipμ a,j > a predefined threshold KD cD bD a : the fuzzy set of documentsassociated to the index k a(the negation of index term k a )3-60

ExampleQuery q=k a ∧ (k b ∨¬k c )disjunctive normal form q dnf =(1,1,1) ∨ (1,1,0) ∨ (1,0,0)(1) the degree of membership in a disjunctive fuzzy set is computedusing an algebraic sum (instead of max function) more smoothly(2) the degree of membership in a conjunctive fuzzy set is computedusing an algebraic product (instead of min function) more smoothlyμq,j= μ= 1−cc1+cc2+cc3,j3∏i=1(1 − μ= 1−(1 − μa,jcci , jμb,j)μc,j) × (1 − μRecall μ ( u)= 1−μ A(u)a,jμb,jA(1 − μc,j)) × (1 − μa,j(1 − μb,j)(1 − μ3-61c,j))

Fuzzy Set Model– Q: “gold silver truck”D1: “Shipment of gold damaged in a fire”D2: “Delivery of silver arrived in a silver truck”D3: “Shipment of gold arrived in a truck”– IDF (Select Keywords)• a = in = of = 0 = log 3/3arrived = gold = shipment = truck = 0.176 = log 3/2damaged = delivery = fire = silver = 0.477 = log 3/1– 8 Keywords (Dimensions) are selected• arrived(1), damaged(2), delivery(3), fire(4), gold(5),silver(6), shipment(7), truck(8)3-62

Fuzzy Set Model3-63

Fuzzy Set Model3-64

Fuzzy Set Model• Sim(q,d): Alternative 1Sim(q,d 3 ) > Sim(q,d 2 ) > Sim(q,d 1 )• Sim(q,d): Alternative 2Sim(q,d 3 ) > Sim(q,d 2 ) > Sim(q,d 1 )3-65

Alternative Algebraic Model:Generalized Vector Space Model• independence of index terms– k i : a vector associated with the index term k i– the set of vectors {k 1 , k 2 , …, k t } is linearly independent• orthogonal:k i• k j= 0for i≠j• Theorem: If the nonzero vectors k1, k2, ···, kn are orthogonal, thenthey are linearly independent.– The index term vectors are assumed linearly independent but arenot pairwise orthogonal in generalized vector space model– The index term vectors, which are not seen as the basis of thespace, are composed of smaller components derived from theparticular collection.3-66

Review• Two vectors u and v are linearly independent– if αu+βv=0 then α=β=0• Two vectors u and v are orthogonal, I.e, θ=90 o– u•v=0 (I.e., u T v=0)• if two vectors u and v are orthogonal, then uand v are linearly independent– assume αu+βv=0, u≠0 and v≠0– u T (αu+βv)=0 --> α u T u+β u T v=0 --> αu T u=03-67

Generalized Vector Space Model• {k 1 , k 2 , …, k t }: index terms in a collection• w i,j : binary weights associated with the term-document pair{k i , d j }• The patterns of term co-occurrence (inside documents) canbe represented by a set of 2 t mintermsm 1 =(0, 0, …, 0): point to documents containing none of index termsm 2 =(1, 0, …, 0): point to documents containing the index term k 1 onlym 3 =(0,1,…,0): point to documents containing the index term k 2 onlym 4 =(1,1,…,0): point to documents containing the index terms k 1 and k 2…m2t =(1, 1, …, 1): point to documents containing all the index terms• g i (m j ): return the weight {0,1} of the index term k i in theminterm m j (1 ≤ i ≤ t)3-68

Generalized Vector Space Model(Continued)mm...m122t= (1,0,...,0,0)= (0,1,...,0,0)= (0,0,...,0,1)mi• m j = 0 for i ≠ j(the set of m i are pairwise orthogonal)• m i (2 t -tuple vector) is associated with minterm m i(t-tuple vector)• e.g., m 4 is associated with m 4 containing k 1 and k 2 ,and no others• co-occurrence of index terms inside documents:dependencies among index terms3-69

t=3minterm m r m r vectorm 1 =(0,0,0) m 1 =(1,0,0,0,0,0,0,0)m 2 =(0,0,1) m 2 =(0,1,0,0,0,0,0,0)m 3 =(0,1,0) m 3 =(0,0,1,0,0,0,0,0)m 4 =(0,1,1) m 4 =(0,0,0,1,0,0,0,0)m 5 =(1,0,0) m 5 =(0,0,0,0,1,0,0,0)m 6 =(1,0,1) m 6 =(0,0,0,0,0,1,0,0)m 7 =(1,1,0) m 7 =(0,0,0,0,0,0,1,0)m 8 =(1,1,1) m 8 =(0,0,0,0,0,0,0,1)kccc11,51,71,8====c1,5www1,1mc1,111,15521,5++++wc1,6+1,4wwm6+c1,131,1921,6c+1,6w+c=1,7c1,14m21,7w71,12+w+d1 (k1) d11 (k1 k2)d2 (k3) d12 (k1 k3)d3 (k3) d13 (k1 k2)d4 (k1) d14 (k1 k2)d5 (k2) d15 (k1 k2 k3)d6 (k2) d16 (k1 k2)d7 (k2 k3) d17 (k1 k2)d8 (k2 k3) d18 (k1 k2)d9 (k2) d19 (k1 k2 k3)d10 (k2 k3) d20 (k1 k2)+c1,16c1,821,8+wm81,17+w1,18+w3-701,20

t=3cccminterm m r m r vectorm 1 =(0,0,0) m 1 =(1,0,0,0,0,0,0,0)m 2 =(0,0,1) m 2 =(0,1,0,0,0,0,0,0)m 3 =(0,1,0) m 3 =(0,0,1,0,0,0,0,0)m 4 =(0,1,1) m 4 =(0,0,0,1,0,0,0,0)m 5 =(1,0,0) m 5 =(0,0,0,0,1,0,0,0)m 6 =(1,0,1) m 6 =(0,0,0,0,0,1,0,0)m 7 =(1,1,0) m 7 =(0,0,0,0,0,0,1,0)m 8 =(1,1,1) m 8 =(0,0,0,0,0,0,0,1)k22,32,72,8====cwww2,32,5m2,112,15c+322,3++w+c2,6ww2,4+2,19c+2,13m4+22,4w+2,9w+c2,142,7cm22,7c2,4+w7d1 (k1) d11 (k1 k2)d2 (k3) d12 (k1 k3)d3 (k3) d13 (k1 k2)d4 (k1) d14 (k1 k2)d5 (k2) d15 (k1 k2 k3)d6 (k2) d16 (k1 k2)d7 (k2 k3) d17 (k1 k2)d8 (k2 k3) d18 (k1 k2)d9 (k2) d19 (k1 k2 k3)d10 (k2 k3) d20 (k1 k2)+=+cw2,16c2,822,82,7+w+m8w2,172,8++ww2,182,10+w3-712,20

t=3kc3minterm m r m r vectorm 1 =(0,0,0) m 1 =(1,0,0,0,0,0,0,0)m 2 =(0,0,1) m 2 =(0,1,0,0,0,0,0,0)m 3 =(0,1,0) m 3 =(0,0,1,0,0,0,0,0)m 4 =(0,1,1) m 4 =(0,0,0,1,0,0,0,0)m 5 =(1,0,0) m 5 =(0,0,0,0,1,0,0,0)m 6 =(1,0,1) m 6 =(0,0,0,0,0,1,0,0)m 7 =(1,1,0) m 7 =(0,0,0,0,0,0,1,0)m 8 =(1,1,1) m 8 =(0,0,0,0,0,0,0,1)3,2==c3,2wm3,2c223,2++wc3,4+3,3cm4+23,4c3,4+c3,6c=23,6wc +m3,76+++c3,8= w3,15w3,19d1 (k1) d11 (k1 k2)d2 (k3) d12 (k1 k3)d3 (k3) d13 (k1 k2)d4 (k1) d14 (k1 k2)d5 (k2) d15 (k1 k2 k3)d6 (k2) d16 (k1 k2)d7 (k2 k3) d17 (k1 k2)d8 (k2 k3) d18 (k1 k2)d9 (k2) d19 (k1 k2 k3)d10 (k2 k3) d20 (k1 k2)c3,823,8w3,8m+8w3,10c3,6=3-72w3,12

Generalized Vector Space Model(Continued)• Determine the index vector k i associatedwith the index term k iki=∑∀r, g ( m ) = 1 i,r∑i∀r,gri( mrc) = 1cm2i , rrCollect all the vectors m r inwhich the index term k i is instate 1.ci,r=dj| gl( dj) = g∑l( mwri,j) foralllSum up w i,j associated withthe index term k i and documentd j whose term occurrencepattern coincides with minterm m r3-73

Generalized Vector Space Model(Continued)• k i •k j quantifies a degree of correlationbetween k i and k jki• k j =∀r|g∑) = 1∧gi,r( m) = 1• standard cosine similarity is adopteddj∑i( mrcj×rc∑j , r= w ii i jk q w∀ ,=∀ii,qkiki=∑∀r, g ( m ) = 1 i,r∑i∀r,gri( mrc) = 1cm2i , rr3-74

3-7523,823,623,423,283,863,63,423,234ccccmcmcmcmck++++++=) /...() /...()() /(3,82,83,42,4323,81,83,61,63122,822,722,422,321,821,721,621,52,81,82,71,721cccckkcccckkcccccccccccckk×+×=•×+×=•+++×+++×+×=•21,821,721,621,581,871,71,651,516ccccmcmcmcmck++++++=22,822,722,422,382,872,72,432,324ccccmcmcmcmck++++++=

Latent Semantic Indexing (LSI) Model• representation of documents and queries byindex terms– problem 1: many unrelated documents might beincluded in the answer set– problem 2: relevant documents which are notindexed by any of the query keywords are notretrieved• possible solution: concept matching insteadof index term matching– application in cross-language informationretrieval (CLIR)3-76

asic idea• Map each document and query vector into alower dimensional space which isassociated with concepts• Retrieval in the reduced space may besuperior to retrieval in the space of indexterms3-77

Definition• t: the number of index terms in the collection• N: the total number of documents• M=(M ij ): a term-document association matrixwith t rows (i.e., term) and N columns (i.e.,document)• M ij : a weight w i,j associated with the termdocumentpair [k i , d j ] (e.g., using tf-idf)3-78

Singular Value DecompositionA∈R(1) A =n×ntt∃Q∈ R st QQ = I { Q Q = I}orthogonalsingularvalue decomposition :A =n×nAtQDQt{ At= ( QDQt)t= ( Qt)ttD Qt=QDQt=A}λ 1λ 2λ nwhere D =0..0.diagonal matrixλ 1 ≥λ 2 ≥… ≥λ n ≥ 03-79

A∈R∃U,VAAtn×n(2) A ≠AA = UDVt∈ Rtn×n= ( UDVtsttU U)( UDVttI,tV Vsin gular value decomposition :)== ( UDV= I orthogonalt)( VDU(AB) T = B T A Tt2) = UD Utλ 1λ 2λ nwhere D =0..0.diagonal matrixλ 1 ≥λ 2 ≥… ≥λ n ≥ 03-80

A=QDQtAQ=tQDQ Q=QDwhere Q=[ q q21Kqn],qi:acolumnvectorλ 1λ 2λ nA[ q q2K qn]= [ q1q2K q1 n]..0.[ AqAq11 Aq2K Aqn]= [ λ1q1 λ2q2Kλnqn]= λ q11Aq2= λ q22KAqλ 1 , λ 2 , …, λ n 為 A 之 eigenvalues,q k 為 A 相對於 λ k 之 eigenvectorn= λ qnn3-81

Singular Value DecompositionMMMM M: a term − document matrix with t rows and N columnst=K S DM : a N × N document − to − document matrixtt: a t × t term − to − term matrixAccording toMt N∈ ×R∃K: thematrixofeigenvectorsderivedfrom M MtKtK=ID: thematrixofeigenvectorsderivedfrom MtMDtD=IM=K S Dt3-82

M=tM= ( K S D= ( DSDSM M=K St22: document − to − document matrixttKD= ( K S D= ( K S DKtttt)tt( K S D)( K S D: term − to − term matrix)( K S D)( DSttK)tttt)))對照 A=QDQ tQ is matrix of eigenvectors of AD is diagonal matrix of singular values得到D : the matrix of eigenvectorsderivedK : thematrixderivedfrom MoftMfrom M MeigenvectorsS : r × r diagonal matrix of sin gularvalues,where r = min( t,N)ts < r (Concept space is reduced)3-83

Consider only the s largest singular values of Sλ 1λ 2λ n00...λ 1 ≥λ 2 ≥… ≥λ n ≥ 0The resultant M s matrix is the matrix of rank s which is closestto the original matrix M in the least square sense.t 由概念分群來說明 :s K s S s D s太細 - 各個 index term 代表不同的概念(s

Ranking in LSI• query: a pseudo-document in the original Mterm-document– query is modeled as the document with number 0– M st M s : the ranks of all documents w.r.t this queryM==tsDs( DMSsssSs=Kts( KK)( DsssSSSsss)DtDtsts)t=KDssSSssDStssDts(i,j) qualifies the relationship betweendocuments d i and d j When i = 0,it denotes similarity between q and documents3-85

Structured Text Retrieval Models• Definition– Combine information on text content with information on thedocument structure– e.g., same-page(near(‘atomic holocaust’, Figure(label(‘earth’))))• Expressive power vs. evaluation efficiency– a model based on non-overlapping lists– a model based on proximal nodes• Terminology– match point: position in the text of a sequence of words thatmatches the user query– region: a contiguous portion of the text– node: a structural component of the document (chap, sec, …)3-86

Non-Overlapping Lists• divide the whole text of each document in nonoverlappingtext regions (lists)• example non-overlapping in a listL0L1indexinglists L2L31Chapter 15000a list of all chapters in the document1 1.1 3000 3001 1.2 5000a list of all sections in the document1 1.1.1 1000 1001 1.1.2 3000 3001 1.2.1 5000a list of all subsections in the document1 500 5011000 1001a list all subsubsections in the documentChapter• Text regions from distinct lists might overlapSectionsSubsectionsSubsubsections3-87

Non-Overlapping Lists(Continued)• Data structure– a single inverted file– each structural component (e.g., chap, sec, …) stands asan entry– for each entry, there is a list of text regions as a listoccurrences• Operations– Select a region which contains a given word– Select a region A which does not contain any otherregion B (where B belongs to a list distinct from the listfor A)– Select a region not contained within any other region– …Recall that there is another invertedfile for the words in the text3-88

Inverted Files• File is represented as an array of indexed records.Term 1 Term 2 Term 3 Term 4Record 1 1 1 0 1Record 2 0 1 1 1Record 3 1 0 1 1Record 4 0 0 1 13-89

Inverted-file process• The record-term array is inverted (transposed).Record 1 Record 2 Record 3 Record 4Term 1 1 0 1 0Term 2 1 1 0 0Term 3 0 1 1 1Term 4 1 1 1 13-90

Inverted-file process (Continued)• Take two or more rows of an inverted term-recordarray, and produce a single combined list of recordidentifiers.Query (term2 and term3)1 1 0 00 1 1 1---------------------------------1

Extensions of Inverted Index Operations(Distance Constraints)• Distance Constraints– (A within sentence B)terms A and B must co-occur in a commonsentence– (A adjacent B)terms A and B must occur adjacently in the text3-92

Extensions of Inverted Index Operations(Distance Constraints)• Implementation– include term-location in the inverted indexesinformation: {R345, R348, R350, …}retrieval: {R123, R128, R345, …}– include sentence-location in the indexesinformation:{R345, 25; R345, 37; R348, 10; R350, 8; …}retrieval:{R123, 5; R128, 25; R345, 37; R345, 40; …}3-93

Extensions of Inverted Index Operations(Distance Constraints)– include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {R345, 2, 3, 5; …}retrieval: {R345, 2, 3, 6; …}– query examples(information adjacent retrieval)(information within five words retrieval)– cost: the size of indexes3-94

Model Based on Proximal Nodes• hierarchical vs. flat indexing structuresnodes: position in the textChapterhierarchicalindexSectionsSubsectionsSubsubsectionsflat indexan inverted … list for holocaustparagraphs, pages, linesholocaust 10 256 … 48,324…entries: positions in the text3-95

Model Based on Proximal Nodes(Continued)• query language– Specification of regular expressions– Reference to structural components by name– Combination– Example• Search for sections, subsections, or subsubsectionswhich contain the word ‘holocaust’• [(*section) with (‘holocaust’)]3-96

Model Based on Proximal Nodes(Continued)• Basic algorithm– Traverse the inverted list for the term ‘holocaust’– For each entry in the list (i.e., an occurrence), searchthe hierarchical index looking for sections, subsections,and sub-subsections• Revised algorithm– For the first entry, search as before– Let the last matching structural component be theinnermost matching componentnearby nodes– Verify the innermost matching component also matchesthe second entry.• If it does, the larger structural components above it also do.3-97

Models for Browsing• Browsing vs. searching– The goal of a searching task is clearer in themind of the user than the goal of a browsingtask• Models– Flat browsing– Structure guided browsing– The hypertext model3-98

Models for Browsing• Flat organization– Documents are represented as dots in a 2-D plan– Documents are represented as elements in a 1-D list, e.g.,the results of search engine• Structure guided browsing– Documents are organized in a directory, which groupdocuments covering related topics• Hypertext model– Navigating the hypertext: a traversal of a directed graph3-99

Trends and Research Issues• Library systems– Cognitive and behavioral issues oriented particularly ata better understanding of which criteria the users adoptto judge relevance• Specialized retrieval systems– e.g., legal and business documents– how to retrieve all relevant documents withoutretrieving a large number of unrelated documents• The Web– User does not know what he wants or has greatdifficulty in formulating his request– How the paradigm adopted for the user interface affectsthe ranking– The indexes maintained by various Web search engineare almost disjoint3-100

Lecture 3 Modeling.pdf

Create successful ePaper yourself

Delete template?

Save as template?