10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

56 3 Algorithms <strong>and</strong> <strong>Techniques</strong>// Function to exp<strong>and</strong> the cluster;Procedure Exp<strong>and</strong>Cluster(o, N, C, Eps, MinPts):C=C ∪ o;For each object o ′ in N doIf o ′ .visited=false theno ′ .visited=true;N ′ = GetNeighbors(o ′ , eps);If |N ′ |≥MinPts then N = N ∪ N ′ ;endIf o ′ is not member of any cluster thenC=C ∪ o ′ ;endend3.4 Semi-supervised LearningIn the previous two sections, we have introduced the learning issues on the labeled data (supervisedlearning or classification), <strong>and</strong> the unlabeled data (unsupervised learning or clustering).In this chapter, we will present the basic learning techniques when both of the two kind of dataare given. The intuition is that large amount of unlabeled data is easier to obtain (e.g., pagescrawled by Google) yet only a small part of them could be labeled due to resource limitation.The research is so-called semi-supervised learning (or semi-supervised classi f ication),which aims to address the problem by using large amount of unlabeled data, together with thelabeled data, to build better classifiers.There are many approaches proposed for semi-supervised classification, in which the representativesare self-training, co-training, generative models, graph-based methods. We willintroduce them in the next few sections. More kinds of strategies <strong>and</strong> algorithms can be foundin [278, 59].3.4.1 Self-TrainingThe idea of self-training (or self-teaching, bootstrapping) appeared long time ago <strong>and</strong> the firstwell-known paper on applying self-training to tackle the machine learning issue may be [265].It is now a common approach for semi-supervised classification.The basic idea of self-training is that: a classifier c is first trained based on the labeleddata L (which has small size). Then we use the classifier c to classify the unlabeled data U.Those confident (judged by a threshold) unlabeled data U ′ with the corresponding label class,are extracted from U <strong>and</strong> inserted into L. The two dataset, labeled <strong>and</strong> unlabeled, are updated<strong>and</strong> the procedure is repeated. The pseudo code of self-training is shown in Algorithm 3.10.Algorithm 3.10: The self-training algorithmInput: A labeled dataset L, an unlabeled dataset U, a confident threshold tOutput: All data with labeled classInitial a classifier c;

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!