21.11.2015 Views

CSC 5800 Intelligent Systems Homework 4

CSC 5800 : Intelligent Systems Homework 4

CSC 5800 : Intelligent Systems Homework 4

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>CSC</strong> <strong>5800</strong> : <strong>Intelligent</strong> <strong>Systems</strong><br />

<strong>Homework</strong> 4<br />

Due Date: November 11 th , 2015<br />

Total: 100 Points<br />

Problem 1. Baye’s Theorem (10 Points; 2 + 3 + 5)<br />

Suppose the fraction of undergraduate students who smoke is 15% and the fraction of<br />

graduate students who smoke is 23%. It is also given that one-fifth of the college students<br />

are graduate students and the rest are undergraduates.<br />

(a) What is the probability that a student who smokes is a graduate student?<br />

(b) Is a randomly chosen smoker more likely to be a graduate or undergraduate student?<br />

(c) Suppose 30% of the graduate students live in a dorm but only 10% of the<br />

undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or<br />

she more likely to be a graduate or undergraduate student? You can assume independence<br />

between students who live in a dorm and those who smoke.<br />

Problem 2. Bayesian Classification (15 Points; 2 + 4 + 3 + 4 + 2)<br />

Consider the data set shown in the following Table:<br />

(a) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), P(A|−), P(B|−),<br />

and P(C|−).<br />

(b) Use these estimates of conditional probabilities to predict the class label for a test<br />

sample (A = 0,B = 1, C = 0) using the naive Bayes approach.<br />

(c) Estimate the conditional probabilities using the m-estimate approach, with p = 1/2<br />

and m = 4.<br />

(d) Repeat part (b) using the conditional probabilities given in part (c).<br />

(e) Compare the two methods for estimating probabilities. Which method is better<br />

and why?


Problem 3. K-means Clustering<br />

(10 Points)<br />

Consider the following six points (with (x,y) representing location in a 2D space) and let<br />

us try to group them into three clusters.<br />

The distance function is Euclidean distance. Suppose initially we assign A1, B1, and C1<br />

as the center of each cluster. Using the k-means algorithm show the:<br />

(i) Cluster assignment of each data point after the first iteration<br />

(ii) Centroids after the first iteration<br />

Problem 4. K-means Clustering (20 Points; 8 + 12)<br />

PART I. Consider the following one-dimensional dataset {1,2,3,5,9}. Perform k-means<br />

algorithm with 2 clusters and initial centroids are 0 and 9. Compute the following: (i)<br />

Final centroids (ii) Cohesion (iii) Separation.<br />

PART II. Consider the following set of one-dimensional data points: {0.1, 0.2, 0.4, 0.5,<br />

0.6, 0.8, 0.9}. (i) Suppose we apply k-means clustering to obtain three clusters, A, B, and<br />

C. If the initial centroids are located at {0, 0.25, 0.6}, respectively, show the cluster<br />

assignments and locations of the centroids after the first three iterations. Compute the<br />

SSE of the k-means solution (after 3 iterations). (ii) Apply bisecting k-means (with k=3)<br />

on the data. First, apply k-means on the data with k=2 using initial centroids located at<br />

{0.1,0.9} Next, compute the SSE for each cluster (make sure you indicate the SSE values<br />

in your answer). Choose the cluster with larger SSE value and split it further into 2 subclusters.<br />

You can choose the two data points with the smallest and largest values as your<br />

initial centroids. For example, if the cluster to be split contains data points (0.20, 0.40,<br />

0.60, and 0.80), then the centroids should be initialized to 0.20 and 0.80. Show the<br />

clustering solution produced obtained applying bisecting k-means.<br />

Problem 5. WEKA – K-means Clustering<br />

(10 Points)<br />

Load iris.arff file into Weka. Click on the Cluster tab and choose “SimpleKMeans”<br />

algorithm for clustering and set “numClusters” to 3. Select “Classes to cluster evaluation”<br />

and click on the “Ignore attributes” and select “class”. Start the clustering.<br />

(a) How many instances were clustered incorrectly? Provide the confusion matrix.<br />

(b) How many instances are in cluster2? How many of these instances were<br />

incorrectly clustered and which cluster they should belong to?<br />

(c) Right-click on the result list and click on “visualize cluster assignments”. Set the<br />

x-axis to instance_numbr and y-axis to sepallength. Change the color to class.<br />

Which type of iris flower has all instances clustered correctly?<br />

Problem 6. Hierarchical Clustering (20 Points; 5 + 7 + 8)<br />

(a) Perform Hierarchical clustering (single Linkage) on the following one-dimensional<br />

dataset {0.1, 1, 1.7, 3.4, 3.9, 4.7} (i) If we want to obtain two clusters, show the cluster<br />

membership for each datapoint. (ii) Draw the Dendrogram.<br />

(b) Consider the following four data points and use the cosine similarity measure and<br />

perform the hierarchical clustering using the single linkage clustering algorithm. Give the<br />

proximity matrix and draw the corresponding Dendrogram obtained after clustering.


A: (0 2 0 0); B. (2 0 1 2); C: (2 1 0 2); D: (2 2 1 0)<br />

(c) Use the similarity matrix in the following Table to perform single and complete<br />

linkage hierarchical clustering. Show your results by drawing a dendrogram that will<br />

clearly show the order in which the points are merged. Also, give the updated similarity<br />

matrix after each merge.<br />

Problem 7. DBSCAN Clustering<br />

(15 Points)<br />

Consider the data set shown in Figure 3. Suppose we apply DBSCAN algorithm with<br />

Eps=0.15 (in Euclidean distance) and MinPts = 3.<br />

List all the core points in the diagram (you can use the labels of the data points in the<br />

diagram). Note: a point is considered a core point if there are more than MinPts number<br />

of points (including the point itself) within a neighborhood of radius Eps. List all the<br />

border points in the diagram. List all the noise points in the diagram.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!