Set-Based Model: A New Approach for Information Retrieval

Introduction• 在 IR 領域 ,vector space model is popuar, 這個 model 的成功 , 大部分因為是 Salton 與他的同事們長期努力的結果 .• VSM 中 ,document 與 query 都由 weighted vectors 來表示 ,相似度的算法 (ranking) is based on 給予 document 與query 中之 index terms 的 weight.• Term weight 的算法有很多種 , 目前仍是個課題 , 目前所知求 weight 最佳的算法 , 是 tf X idf scheme• tf X idf 考慮兩方面因素來計算 index term 的 weight:(1)index term 在文件出現次數 (2) 整個 collection 中出現此 index term 的文件數

SBM 核心精神 – Closed Termsets3.S = { s , s2.....s21 t}S is the vocabulary-set of a collection of D=> 每個 document 可能包含好幾個 s i, 因為字會重覆算4. ls i: for each termset s i, 1 ≤ i ≤ 2 t , we associate andinverted list, 存哪些 document 出現過此 termsetds i: frequency of a termset s ias the number ofoccurrences of s iin D (dsi= )A termset s iis a frequent termset if its frquency ds iisgreater than or equal to a given thresholdlsi

SBM 核心精神 – Closed Termsets5. A closed termset cs iis a frequent termset that is thelargest termset among the termsets that are subsets of cs iand occur in the same set of documents.6. A maximal termset ms iis a frequent termset that is not asubset of any other frequent termset.已經有人證明 , the set of maximal termsets associatedwith a codument collection are the minimum amount ofinformation necessary to derive all frequent termsetsassociated with a colletion

範例T={a,b,c,d,e}threshold=50%

演算法流程 –determine closed termsets1.1-termsets is above a given threshold?若是 , 將此 termset 設為 closed, 並進入 22.A new n+1-termset s newis determined by si ∪ s j (s i, s jboth n-termset, have the same first n-1 terms) 而產生lnew= li∩lj3. 檢查 s new是否 frequent(Apriori algorithm)原則 : n-termset may be frequent only if all of its n−1-termsets are also frequent4. 若 s new=frequent, 檢查是否最大 , 是則將較小的取消closed 並將 s new設為 closed, 否則 s newis discared

實驗結果三種 collection 的 main features

Retrieval Performance- CVC collection

Retrieval Performance-WSJ colletion

Retrieval Performance-TREC-3 collection

Overall average precision

Average precision of top 10 documents

Average number of closed termsets andthe average list sizes while using SBM

Response time

Conclusions and future work• SBM improve retrieval effectiveness• The computation of frequent termsets enumerated by analgorithm to generate association rules lead to a directextension of the vector space model• For future work we will extend SBM to account for theproximity information about query terms in documents

Set-Based Model: A New Approach for Information Retrieval

Create successful ePaper yourself

Delete template?

Save as template?