Predicting Cardiovascular Risks using Pattern Recognition and Data ...

More documents

Recommendations

Info

just the Euclidian measurements of the K-means. These measures will be adequate for the mixture ofattribute types in the thesis data.4.3.2.1. Introduction and NotationsClustering analysis is a machine learning area of particular interest to pattern recognition data mining.The resulting data partition improves the data understanding, and reveals its internal structure.Clustering has been used in many application domains including biology, medicine, and so on, asindicated in Chapter 3. Further clustering applications can be seen in Dunham (2002); Berkhin (2002);Mirkin (2005).One well-known partition clustering algorithm is the K-means algorithm (Forgey, 1965; Jancey, 1966;MacQueen, 1967; Hartigan, 1975; Hartigan and Wong, 1979) which, according to Berkhin (2002), isthe most popular clustering tool used in scientific and industrial applications. The main duty of K-means is to partition n patterns in input space into k clusters such that similar patterns belong to thesame clusters; and dissimilar patterns will belong to alternative clusters.Assume that X is a pattern (observation, case, or patient record in data set). X typically consists of mcomponents: X= (x 1, x 2 ,…,x m ) = (x j ) j=1,..m . Note that each component in multidimensional space is anattribute (continuous or discrete) in the data domain. Therefore, we have a n m pattern matrix, wheren patterns {X i } i=1..n ; X i = (x i,1 , x i,2 ,.., x i,m ), and m attributes.Similarity MeasurementsA similarity measurement is the strength of relationship between two patterns in the samemultidimensional space. It can be represented as sim ij = sim(x i ,x j ), i,j= 1,2,.. n. According to Gower(1985), a similarity is regarded as a symmetric relationship. This means sim(x i ,x j )=sim(x j ,x i ).Contrastingly, dissimilarity measures of patterns have been introduced as the complement of similaritymeasures. A list of dissimilarity measures can be seen in Gower (1985). The dissimilarity measuresused in this thesis are as follows:Continuous attributes: the most common measure used is the Euclidean distance between twopatterns. It is given by:2dissim(x , x ) [ D(x , x )] ijijm1( x x2) , i,j 1,2..n(4.14)ik jkk 1,51where D is Euclidean distance, m 1
Discrete (categorical) attributes: the similarity measure between two patterns depends on the numberof similar values in the categorical attribute (Kaufman & Rousseeuw, 1990). This means thedissimilarity is a number of different values in this categorical attribute. It is given by:dissim(x , x ) d(x , x ) ijijm2k1(xik, xjk)i,j 1,2..n(4.15);0if xik x jkwhere ( xik , x jk ) , k 1,2,..m2;i,j 1,2..,n , and m1 if xikx2 is number of categoricaljkattributes.Boolean attributes: The dissimilarity measures are calculated as in the categorical or continuousattributes according to the interpretation of the attribute.Centre VectorsAssume that the data attribute set includes continuous and discrete attributes. Note that Booleanattributes are treated as continuous or discrete as indicated above. Therefore, there are two types ofcentre vectors. Assume that m attributes contains the p first continuous attributes; and m-p remainingdiscrete attributes. This means each pattern X in the input space can be seen as:X=(x i1 , x i2, …x ip , x ip+1, x ip+2, … x im ) )If Q is a centre vector for the sub data set C, Q can be represented as:Q= (q j1, q j2, … q jp , q jp+1, q jp+2, …, q jm )The task now is to find p continuous attribute values, and m-p discrete attribute values for centre vectorQ. According to Han (1981), these centre attribute values can be calculated as follows:Continuous attribute: The centre values {q jk } k=1,..p ,={mean k }, where mean k is the average of k thattribute.Discrete attribute: The centre values {q jk } k=p+1,.., m ={mode k }, where mode k is the “mode” of k thattribute.Definition 1: A vector Q is a “mode vector” of a data set C = (X 1, X 2, … X c ) , c
Page 1 and 2:
THE UNIVERSITY OF HULLPredicting Ca
Page 3 and 4:
3.4.2. Nonlinear Models ...........
Page 5 and 6:
8.3. Mutual Information ...........
Page 7 and 8:
List of FiguresFig. 3.1 Typical pat
Page 9 and 10:
List of TablesTable 2.1The 11 facto
Page 11 and 12:
Table 6.14 Alternative number of cr
Page 13 and 14:
Table C12 Neural network results of
Page 15 and 16: AbbreviationsNN(s): Neural Network(
Page 17 and 18: Chapter 1 IntroductionThis thesis p
Page 19 and 20: on. According to Bishop (1995), the
Page 21 and 22: techniques is given. The standard c
Page 23 and 24: A knowledge base that stores the in
Page 25 and 26: This score is then compared to the
Page 27 and 28: Cardiac signsRespiratory signsSysto
Page 29 and 30: Given the ratings in Table 2.2, and
Page 31 and 32: R11e3 z2Reg.No PATIENT_STATUS Physi
Page 33 and 34: designed to their specific purposes
Page 35 and 36: Chapter 3 Pattern Recognition3.1. I
Page 37 and 38: algorithms depending on the request
Page 39 and 40: a new patient. For example, t 5 can
Page 41 and 42: And the strategy for pattern learni
Page 43 and 44: Hence, whenever there is a specific
Page 45 and 46: 3.4.2. Nonlinear ModelsDefinition:
Page 47 and 48: vector machine are nonlinear models
Page 49 and 50: From the confusion matrix in Table
Page 51 and 52: the accuracy of 0.90 shows the trad
Page 53 and 54: The comparison between neural netwo
Page 55 and 56: Chapter 4 Supervised and Unsupervis
Page 57 and 58: Therefore, the visualization of pat
Page 59 and 60: A popular form of Gaussian basic fu
Page 61 and 62: mj 1w ( x) b 0jj(4.7)where x is a
Page 63 and 64: as an instance of the popular K-mea
Page 65: o Update the weight as follows:w i
Page 69 and 70: where d N (X i ,Q j ) is calculated
Page 71 and 72: Zoo SmallThis data contains 101 cas
Page 73 and 74: 1.20Discussions1.000.800.600.400.20
Page 75 and 76: Chapter 5 Data Mining Methodology a
Page 77 and 78: Raw dataDataselectingTarget DataPre
Page 79 and 80: missing data”, the more detailed
Page 81 and 82: Step 5 (Comparison/ Evaluation): Th
Page 83 and 84: Missing values: The data includes 1
Page 85 and 86: Scoring Risk Models: These are buil
Page 87 and 88: The missing values for the “Heart
Page 89 and 90: Data Filtering StageIn this step, t
Page 91 and 92: Supervised ClassifiersThis section
Page 93 and 94: 5.4. SummaryData mining is a partic
Page 95 and 96: the significant level of 98.6% (P v
Page 97 and 98: Attribute nameAttributetypeMissingv
Page 99 and 100: Attribute nameAttributetypeMissingv
Page 101 and 102: 6.4. Scoring Risk ModelsThe data fo
Page 103 and 104: are nearly the same (12). The maxim
Page 105 and 106: inconsistency between the linear an
Page 107 and 108: epochs can help to reduce the over-
Page 109 and 110: MSE (0.09). However, the model MLP_
Page 111 and 112: suitable for the CM3aD data set wit
Page 113 and 114: MethodA self organizing map tool is
Page 115 and 116: The SOM algorithm is applied to the
Page 117 and 118:
clustering models CM3aDC and CM3bDC
Page 119 and 120:
clustering results, resulting from
Page 121 and 122:
and CM3bD (KMIX and SOM) suggest th
Page 123 and 124:
equivalent rates for “High risk
Page 125 and 126:
25 input attributes to 16 input att
Page 127 and 128:
7.2.4. Scoring Risk ModelsThis sect
Page 129 and 130:
increasing number of patterns impro
Page 131 and 132:
other words, there are negligible o
Page 133 and 134:
ACCRand = PRand(true positive true
Page 135 and 136:
Therefore, the poor clustering perf
Page 137 and 138:
difficulty of measuring influential
Page 139 and 140:
“High risk” predictions in "cli
Page 141 and 142:
chosen by the user. The methods for
Page 143 and 144:
average information content of the
Page 145 and 146:
MI(X , Y) H(X ) H(X | Y) pi,j( x,
Page 147 and 148:
number of patterns in class C i ; a
Page 149 and 150:
The results in Figure 8.3 show that
Page 151 and 152:
and the prediction risks reflects i
Page 153 and 154:
Adding attribute weights to the clu
Page 155 and 156:
system. Moreover, in POSSUM and PPO
Page 157 and 158:
in scientific and industrial applic
Page 159 and 160:
eing located in alternative risk ba
Page 162 and 163:
and computer scientists, to verify
Page 164 and 165:
Baxt, W. G. (1992). Use of an Artif
Page 166 and 167:
Cristianini, N., Shawe-Taylor, J. (
Page 168 and 169:
Hawkins, R.G., Houston, M.C., Ferra
Page 170 and 171:
Kohonen, T. (1981). Self-organized
Page 172 and 173:
Manning, C.D., Raghvan, P., and Sch
Page 174 and 175:
Ohn, M. S., Van-Nam, H., and Yoshit
Page 176 and 177:
Simelius, K., Stenroos, M., Reinhar
Page 178 and 179:
Wang, K. Wang, L. Wang, D. and Xu,
Page 180 and 181:
Appendix A. Data structureA.1. Hull
Page 182 and 183:
A.2. Dundee siteThe following table
Page 184 and 185:
B.2. Scoring Risk ModelsTable B2 sh
Page 186 and 187:
The results are then divided to dif
Page 188 and 189:
Attribute NameOriginal data Transfo
Page 190 and 191:
AttributeName175Attribute TypeMissi
Page 192 and 193:
Filtering task: The expected outcom
Page 194 and 195:
Data is clustered by SOM Kmeans alg
Page 196 and 197:
197 values of “C2H”. The model
Page 198 and 199:
PATIENT_STATUS Boolean 0 Alive/Dead
Page 200 and 201:
Cleaning and Transformation tasks:
Page 202 and 203:
High riskLow riskCM3a-MLPCM3a-RBFCM
Page 204 and 205:
Table C19: Experimental results of
Page 206 and 207:
Therefore, the Mortality has got 25
Page 208 and 209:
Both models CM3aC and CM3bC are use
Page 210 and 211:
R1 PATECGAMBUL_STATUR1-A SIDEANTI_A
Page 212:
Step 5 (Building New Models): The n
show all

Predicting Cardiovascular Risks using Pattern Recognition and Data ...

Create successful ePaper yourself

Delete template?

Save as template?