anytime algorithms for learning anytime classifiers saher ... - Technion

More documents

Recommendations

Info

Technion - Computer Science Department - Ph.D. Thesis PHD-2008-12 - 2008 distribution over the data is significantly different from the uniform distribution. Therefore, at each node, skewing repeatedly attempts to produce, from the existing set of examples, skewed versions that represent different distributions. Several different weight vectors are repeatedly assigned to the examples for this purpose. Then, the information gain of each attribute is calculated on the basis of each weight vector. The attribute that exceeds a pre-determined gain threshold for the greatest number of times is chosen. In order to generate the weight vectors, a favored value vi is randomly drawn for each candidate attribute ai. Then, the weight of each example in which ai takes the value vi is increased. This process can be viewed as a stochastic clustering of the examples according to the values they take for a subset of the attributes: examples that most agree with the favored value for a large number of attributes are assigned the highest weight. Therefore, the calculation of the information gain will be most affected by sets of examples that share the same values for different subsets of the attributes. In regular k-steps lookahead, for each tuple of k attributes we divide the set of examples into 2 k subsets, each of which is associated with a different k-tuple of binary values, i.e., a different sub-path of length k in the lookahead tree. Then, we calculate the information gain for each subset and compute the average gain weighted by subset sizes. In skewing, each iteration assigns a high weight to a few subsets of the examples that have common values for some of the attributes. These subsets have the greatest influence on the gain calculation at that skewing iteration. Hence, skewing iterations can be seen as sampling the space of lookahead sub-paths and considering a few of them at a time. The great advantage of skewing over our proposed framework is efficiency. While LSID3 samples full decision trees, skewing samples only single sub-paths. When there are many examples but relatively few attributes, skewing improves greatly over the greedy learners at a low cost. The effectiveness of skewing, however, noticeably degrades when the concept hides a mutual dependency between a large number of attributes. When the number of attributes increases, it becomes harder to create large clusters with common values for some of the relevant ones, and hence the resulting lookahead is shallower. Consider, for example, the n-XOR problem with n additional irrelevant attributes. For n = 2, a reweighting that assigns a high weight to examples that agree only on 1 of the 2 relevant attributes results in a high gain for the second attribute because it decreases the entropy of the cluster to 0. Nevertheless, in order to have a positive gain for a relevant attribute when n = 10, we should cluster together examples that agree on 9 of the 10 relevant attributes. Obviously, the probability for a good cluster in the first case is high while in the second case it is almost 0. The experiments reported in Section 3.7.6 provide an empirical backup for this argument. 122
Technion - Computer Science Department - Ph.D. Thesis PHD-2008-12 - 2008 Average accuracy Average tree size 100 90 80 70 60 50 220 200 180 160 140 120 100 80 60 40 20 0 LSID3(5) Skewing(30) Sequential Skewing ID3 10 15 20 25 30 35 40 45 50 55 60 Number of attributes (only 4 are relevant) LSID3(5) Skewing(30) Sequential Skewing ID3 10 15 20 25 30 35 40 45 50 55 60 Number of attributes (only 4 are relevant) Figure 6.1: The sensitivity of LSID3 and skewing to irrelevant attributes. The concept is XOR-4 while all the other attributes are irrelevant. The x-axis represents the total number of attributes. A similar problem occurs if we fix the number of relevant attributes but increase the number of irrelevant ones. In that case, the space of possible sub-paths becomes larger and the probability of creating a good cluster decreases. To test this hypothesis, we compared the sensitivity of LSID3, skewing, and Sequential skewing to the number of irrelevant attributes. LSID3 was run with r = 5 while the skewing algorithms were run with their default parameters. The target concept (XOR-4) and the number of examples (512) were fixed, while the number of irrelevant attributes ranged from 6 to 56 (thus, the total number of attributes ranged from 10 to 60). Figure 6.1 displays the results. The graphs indicate that LSID3 continues to attain high accuracy and produce almost optimal trees even when the number of irrelevant attributes increases. The degradation in LSID3 performance is noticeable only when the number of 123
Page 1 and 2:
Technion - Computer Science Departm
Page 3 and 4:
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88: Technion - Computer Science Departm
Page 137: Technion - Computer Science Departm
Page 189 and 190:
Page 191 and 192:
show all

anytime algorithms for learning anytime classifiers saher ... - Technion

Create successful ePaper yourself

Delete template?

Save as template?