CSC 5800 Intelligent Systems Homework 4
CSC 5800 : Intelligent Systems Homework 4
CSC 5800 : Intelligent Systems Homework 4
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>CSC</strong> <strong>5800</strong> : <strong>Intelligent</strong> <strong>Systems</strong><br />
<strong>Homework</strong> 4<br />
Due Date: November 11 th , 2015<br />
Total: 100 Points<br />
Problem 1. Baye’s Theorem (10 Points; 2 + 3 + 5)<br />
Suppose the fraction of undergraduate students who smoke is 15% and the fraction of<br />
graduate students who smoke is 23%. It is also given that one-fifth of the college students<br />
are graduate students and the rest are undergraduates.<br />
(a) What is the probability that a student who smokes is a graduate student?<br />
(b) Is a randomly chosen smoker more likely to be a graduate or undergraduate student?<br />
(c) Suppose 30% of the graduate students live in a dorm but only 10% of the<br />
undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or<br />
she more likely to be a graduate or undergraduate student? You can assume independence<br />
between students who live in a dorm and those who smoke.<br />
Problem 2. Bayesian Classification (15 Points; 2 + 4 + 3 + 4 + 2)<br />
Consider the data set shown in the following Table:<br />
(a) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), P(A|−), P(B|−),<br />
and P(C|−).<br />
(b) Use these estimates of conditional probabilities to predict the class label for a test<br />
sample (A = 0,B = 1, C = 0) using the naive Bayes approach.<br />
(c) Estimate the conditional probabilities using the m-estimate approach, with p = 1/2<br />
and m = 4.<br />
(d) Repeat part (b) using the conditional probabilities given in part (c).<br />
(e) Compare the two methods for estimating probabilities. Which method is better<br />
and why?
Problem 3. K-means Clustering<br />
(10 Points)<br />
Consider the following six points (with (x,y) representing location in a 2D space) and let<br />
us try to group them into three clusters.<br />
The distance function is Euclidean distance. Suppose initially we assign A1, B1, and C1<br />
as the center of each cluster. Using the k-means algorithm show the:<br />
(i) Cluster assignment of each data point after the first iteration<br />
(ii) Centroids after the first iteration<br />
Problem 4. K-means Clustering (20 Points; 8 + 12)<br />
PART I. Consider the following one-dimensional dataset {1,2,3,5,9}. Perform k-means<br />
algorithm with 2 clusters and initial centroids are 0 and 9. Compute the following: (i)<br />
Final centroids (ii) Cohesion (iii) Separation.<br />
PART II. Consider the following set of one-dimensional data points: {0.1, 0.2, 0.4, 0.5,<br />
0.6, 0.8, 0.9}. (i) Suppose we apply k-means clustering to obtain three clusters, A, B, and<br />
C. If the initial centroids are located at {0, 0.25, 0.6}, respectively, show the cluster<br />
assignments and locations of the centroids after the first three iterations. Compute the<br />
SSE of the k-means solution (after 3 iterations). (ii) Apply bisecting k-means (with k=3)<br />
on the data. First, apply k-means on the data with k=2 using initial centroids located at<br />
{0.1,0.9} Next, compute the SSE for each cluster (make sure you indicate the SSE values<br />
in your answer). Choose the cluster with larger SSE value and split it further into 2 subclusters.<br />
You can choose the two data points with the smallest and largest values as your<br />
initial centroids. For example, if the cluster to be split contains data points (0.20, 0.40,<br />
0.60, and 0.80), then the centroids should be initialized to 0.20 and 0.80. Show the<br />
clustering solution produced obtained applying bisecting k-means.<br />
Problem 5. WEKA – K-means Clustering<br />
(10 Points)<br />
Load iris.arff file into Weka. Click on the Cluster tab and choose “SimpleKMeans”<br />
algorithm for clustering and set “numClusters” to 3. Select “Classes to cluster evaluation”<br />
and click on the “Ignore attributes” and select “class”. Start the clustering.<br />
(a) How many instances were clustered incorrectly? Provide the confusion matrix.<br />
(b) How many instances are in cluster2? How many of these instances were<br />
incorrectly clustered and which cluster they should belong to?<br />
(c) Right-click on the result list and click on “visualize cluster assignments”. Set the<br />
x-axis to instance_numbr and y-axis to sepallength. Change the color to class.<br />
Which type of iris flower has all instances clustered correctly?<br />
Problem 6. Hierarchical Clustering (20 Points; 5 + 7 + 8)<br />
(a) Perform Hierarchical clustering (single Linkage) on the following one-dimensional<br />
dataset {0.1, 1, 1.7, 3.4, 3.9, 4.7} (i) If we want to obtain two clusters, show the cluster<br />
membership for each datapoint. (ii) Draw the Dendrogram.<br />
(b) Consider the following four data points and use the cosine similarity measure and<br />
perform the hierarchical clustering using the single linkage clustering algorithm. Give the<br />
proximity matrix and draw the corresponding Dendrogram obtained after clustering.
A: (0 2 0 0); B. (2 0 1 2); C: (2 1 0 2); D: (2 2 1 0)<br />
(c) Use the similarity matrix in the following Table to perform single and complete<br />
linkage hierarchical clustering. Show your results by drawing a dendrogram that will<br />
clearly show the order in which the points are merged. Also, give the updated similarity<br />
matrix after each merge.<br />
Problem 7. DBSCAN Clustering<br />
(15 Points)<br />
Consider the data set shown in Figure 3. Suppose we apply DBSCAN algorithm with<br />
Eps=0.15 (in Euclidean distance) and MinPts = 3.<br />
List all the core points in the diagram (you can use the labels of the data points in the<br />
diagram). Note: a point is considered a core point if there are more than MinPts number<br />
of points (including the point itself) within a neighborhood of radius Eps. List all the<br />
border points in the diagram. List all the noise points in the diagram.