10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

104 5 <strong>Web</strong> Linkage <strong>Mining</strong>Fig. 5.10. A virtual document is comprised of anchortexts <strong>and</strong> nearby words from pages thatlink to the target documentdocument) was converted into a set of features that occurred <strong>and</strong> then appropriate histogramswere updated.For example, if a document has the sentence: “My favorite game is scrabble”, the followingfeatures are generated: my, my favorite, my favorite game, favorite, favorite game, favoritegame is, etc. From the generated features an appropriate histogram is updated. There is onehistogram for the positive set <strong>and</strong> one for the negative set.Categorizing web pages is a well researched problem. An SVM classifier [248] are usedbecause it is resistant to overfitting, can h<strong>and</strong>le large dimensionality, <strong>and</strong> has b<strong>ee</strong>n shown to behighly effective when compared to other methods for text classification. When using a linearkernel function, the final output is a weighted feature vector with a bias term. The returnedweighted vector can be used to quickly classify a test document by simply taking the dotproduct of the features.In [101], authors compare thr<strong>ee</strong> different methods for classifying a web page: full-text,anchortext only, <strong>and</strong> extended anchortext only. Their experiment results show that anchortextalone is comparable for classification purposes with the full-text. Several papers agr<strong>ee</strong> thatfeatures on linking documents, in addition to the anchortext (but less than the whole page)can provide significant improvements. This work is consistent with these results, showingsignificant improvement in classification accuracy when using the extended anchortext insteadof the document fulltext. For comparison, this method is applied to full-texts for the categoriesof courses <strong>and</strong> faculty from the <strong>Web</strong>KB dataset.The combination method is also highly effective for improving positive-class accuracy,but reduces negative class accuracy. The automatic combination also provided substantial improvementover the extended anchortext or the full-text alone for positive accuracy, but causeda slight reduction in negative class accuracy as compared to the extended anchortext case.More details are in [101].Other works have utilized in-bound anchortext to help classify target web pages. For example,Blum <strong>and</strong> Mitchell [37] compared two classifiers for several computer science web

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!