12.07.2015 Views

Predicting Cardiovascular Risks using Pattern Recognition and Data ...

Predicting Cardiovascular Risks using Pattern Recognition and Data ...

Predicting Cardiovascular Risks using Pattern Recognition and Data ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

in scientific <strong>and</strong> industrial applications (Berkhin, 2002). An example of K-means, KMIX, is used withthe thesis data. The results (see in Table 4.5 in Chapter 4) show an improvement in the ability to dealwith the mixture of numerical, categorical, <strong>and</strong> Boolean attributes in the data set, by <strong>using</strong> bothEuclidean <strong>and</strong> Hamming distances to the appropriate clustering centre. The centre vector types <strong>and</strong> itsmeasures can be listed according to attribute types as follows:Numerical (continuous) attributes: The centre is the average value of all attributed values. Themeasurement method used here is Euclidean distance.Categorical (discrete) attributes: The centre is the mode (maximum frequency value) in theattribute. The measurement method is Hamming distance (see in Equation (4.17) in Chapter 4).Boolean attributes: They can be viewed as either numerical or categorical attributes. In the thesis,they are, in most cases, treated as categorical attributes.The KMIX algorithm is compared to the publicised <strong>and</strong> K-means results through the use of st<strong>and</strong>arddata sets from the UCI repository (Merz & Merphy, 1996). The better results for KMIX (see in Table4.5 in Chapter 4) lead to confidence in its use for the thesis data.9.2.5. Can The Attribute Set Be Decreased By Defining The SignificantAttributes For <strong>Data</strong> Domain?The use of the filtering <strong>and</strong> ranking methods based on mutual information can reduce the number ofattributes for the data domain. The mutual information between the attributes <strong>and</strong> output classes can becalculated as Equation (8.15) in Chapter 8. This is based on Bayes‟ theorem (Bayes, 1763); the entropy(Shannon, 1948); <strong>and</strong> Kullback Leibler divergence (Cover <strong>and</strong> Thomas, 1991); given by:MI(C,x ) jpilog(q . rc sjk pijki1 k1i jk),wherepijksumiqi ,sumsumijksumwith sum ijk is number of patterns in class C i, with attribute j th <strong>and</strong> in the k th state,sumwhere sum i is number of patterns belong to class C i , <strong>and</strong> r jkjksum,where sum jk is number of patterns in attribute j th <strong>and</strong> in k th state.142

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!