Introduction To Data Mining Instructor's Solution Manual

Recommendations

Info

22 Chapter 3 Exploring <strong>Data</strong> The best approach is to estimate what the actual distribution function of the data looks like using kernel density estimation. This branch of data analysis is relatively well-developed and is more appropriate if the widely available, but simplistic approach of a histogram is not sufficient. 8. Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in Figure 3.11? (a) If the line representing the median of the data is in the middle of the box, then the data is symmetrically distributed, at least in terms of the 75% of the data between the first and third quartiles. For the remaining data, the length of the whiskers and outliers is also an indication, although, since these features do not involve as many points, they may be misleading. (b) Sepal width and length seem to be relatively symmetrically distributed, petal length seems to be rather skewed, and petal width is somewhat skewed. 9. Compare sepal length, sepal width, petal length, and petal width, using Figure 3.12. For Setosa, sepal length > sepal width > petal length > petal width. For Versicolour and Virginiica, sepal length > sepal width and petal length > petal width, but although sepal length > petal length, petal length > sepal width. 10. Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income. A great deal of information can be obtained by looking at (1) the box plots for each attribute, and (2) the box plots for a particular attribute across various categories of a second attribute. For example, if we compare the box plots of age for different categories of ages, we would see that weight increases with age. 11. Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9. We would expect such a distribution if the three species of Iris can be ordered according to their size, and if petal length and width are both correlated to the size of the plant and each other. 12. Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes.
There is a relatively flat area in the curves of the Empirical CDF’s and the percentile plots for both petal length and petal width. This indicates a set of flowers for which these attributes have a relatively uniform value. 13. Simple line plots, such as that displayed in Figure 2.12 on page 56, which shows two time series, can be used to effectively display high-dimensional data. For example, in Figure 56 it is easy to tell that the frequencies of the two time series are different. What characteristic of time series allows the effective visualization of high-dimensional data? The fact that the attribute values are ordered. 14. Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book. Any set of data for which all combinations of values are unlikely to occur would produce sparse data cubes. This would include sets of continuous attributes where the set of objects described by the attributes doesn’t occupy the entire data space, but only a fraction of it, as well as discrete attributes, where many combinations of values don’t occur. A dense data cube would tend to arise, when either almost all combinations of the categories of the underlying attributes occur, or the level of aggregation is high enough so that all combinations are likely to have values. For example, consider a data set that contains the type of traffic accident, as well as its location and date. The original data cube would be very sparse, but if it is aggregated to have categories consisting single or multiple car accident, the state of the accident, and the month in which it occurred, then we would obtain a dense data cube. 15. How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest? A summary statistics that would be of interest would be the frequencies with which values or combinations of values, target and otherwise, occur. From this we could derive conditional relationships among various values. In turn, these relationships could be displayed using a graph similar to that used to display Bayesian networks. 23
Page 1: Introduction to Data Mining Instruc
Page 5 and 6: Introduction 1 1. Discuss whether o
Page 7: 3. For each of the following data s
Page 10 and 11: 6 Chapter 2 Data (i) Ability to pas
Page 12 and 13: 8 Chapter 2 Data (a) How would you
Page 14 and 15: 10 Chapter 2 Data 13. Consider the
Page 16 and 17: 12 Chapter 2 Data x = 0101010001 y
Page 18 and 19: 14 Chapter 2 Data Same as previous
Page 20 and 21: 16 Chapter 2 Data 23. Given a simil
Page 22 and 23: 18 Chapter 2 Data However, note tha
Page 24 and 25: 20 Chapter 3 Exploring Data the vie
Page 28 and 29: 24 Chapter 3 Exploring Data 16. Con
Page 30 and 31: 26 Chapter 4 Classification The pre
Page 32 and 33: 28 Chapter 4 Classification (b) Wha
Page 34 and 35: 30 Chapter 4 Classification where P
Page 36 and 37: 32 Chapter 4 Classification The inf
Page 38 and 39: 34 Chapter 4 Classification X C1 C2
Page 40 and 41: 36 Chapter 4 Classification Answer:
Page 42 and 43: 38 Chapter 4 Classification 0 A B C
Page 44 and 45: 40 Chapter 4 Classification • Eac
Page 46 and 47: 42 Chapter 4 Classification Table 4
Page 49 and 50: Classification: Alternative Techniq
Page 51 and 52: Answer: For this problem, P = 500,
Page 53 and 54: 49 For R2, the expected frequency f
Page 55 and 56: Answer: If the examples for R1 are
Page 57 and 58: 53 (b) Use the estimate of conditio
Page 59 and 60: Table 5.2. Data set for Exercise 8.
Page 61 and 62: 57 (b) The prior probabilities are
Page 63 and 64: Answer: 59 P (B = good, F = empty,
Page 65 and 66: 15. For each of the Boolean functio
Page 67 and 68: When t =0.5, the confusion matrix f
Page 69 and 70: Y =0 Y =1 Y =2 + 0 30 0 − 200 200
Page 71 and 72: The total cost is F = p(a + d)+q(b
Page 73 and 74: Notice that the dual Lagrangian dep
Page 75 and 76: Association Analysis: Basic Concept
Page 77 and 78:
(d) Use the results in part (c) to
Page 79 and 80:
ζ({A, B, C}) = min � c(A −→
Page 81 and 82:
Since s(A, B, C) ≤ s(A, B) and mi
Page 83 and 84:
{1, 3, 4, 5},{1, 3, 4, 6},{2, 3, 4,
Page 85 and 86:
L1 1,4,7 2,5,8 3,6,9 L5 L6 1,4,7 L7
Page 87 and 88:
ab ac null a b c d e ad abc abd abe
Page 89 and 90:
Table 6.4. Example of market basket
Page 91 and 92:
items. We will apply the Apriori al
Page 93 and 94:
89 where β =1− P (A) +2P (A, B).
Page 95 and 96:
91 (d) How does M behave when P (B)
Page 97 and 98:
C = 0 C = 1 Table 6.5. A Contingenc
Page 99 and 100:
Association Analysis: Advanced Conc
Page 101 and 102:
97 The number of frequent itemsets
Page 103 and 104:
Pressure Answer: The graph of Tempe
Page 105 and 106:
101 for A.) For each rule that corr
Page 107 and 108:
When bin − width =4: Where Table
Page 109 and 110:
105 approximate each interval by it
Page 111 and 112:
107 - Portuguese - Russian - Spanis
Page 113 and 114:
109 (b) Consider the approach where
Page 115 and 116:
111 (c) List all the 4-subsequences
Page 117 and 118:
• s =< {1, 2, 3, 4, 5, 6}{1, 2, 3
Page 119 and 120:
115 When there is no timing constra
Page 121 and 122:
b a a a a a b a a b a a b a a a (a)
Page 123 and 124:
a b 1 1 1 1 G 1 G 3 c e 1 d a 1 e 1
Page 125 and 126:
Table 7.17. Example of numeric data
Page 127:
123 (d) How should the support coun
Page 130 and 131:
126 Chapter 8 Cluster Analysis: Bas
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
Page 151 and 152:
Cluster Analysis: Additional Issues
Page 153 and 154:
149 The main advantage of this appr
Page 155 and 156:
y 6 4 2 0 -2 -4 -6 -8 -10 -8 -6 -4
Page 157 and 158:
Table 9.1. Two nearest neighbors of
Page 159:
155 Proof: We know d(b, c) ≤ d(b,
Page 162 and 163:
158 Chapter 10 Anomaly Detection ni
Page 164 and 165:
160 Chapter 10 Anomaly Detection tr
Page 166 and 167:
162 Chapter 10 Anomaly Detection Th
Page 168 and 169:
164 Chapter 10 Anomaly Detection 15
show all

Introduction To Data Mining Instructor's Solution Manual

Create successful ePaper yourself

Delete template?

Save as template?