Sample classification using the GA/KNN algorithm with varying ...

Sample classification using the GA/KNN algorithm with varying values 

of the p parameter 

By 

E. Prabhu Raman 

George Mason University 

Abstract : The GA/KNN method developed by Leping Li 1 has been tested and is 

proved to classify ALL/AML samples efficiently. This study explores the results obtained 

after modifying the KNN method used in the GA/KNN method by varying the 

minkowski’s p-parameter. The GA/KNN method did not efficiently classify the test 

samples in the runs performed. Out of 34 test samples, around 12 - 14 samples were 

misclassified in each run. It was observed that an odd numbered p parameter (3,5,7) gave 

the same classification result, it classified all of the samples as ALL. It were only the 

even numbered parameters(2,4,6) which could classify some of the AML samples 

correctly in the test data. 

Introduction : With the invention of microarray technology, it is possible to measure 

the expression of thousands of genes at a time. A microarray image gives a snapshot of 

the activity of each gene of a cell at the time of mRNA extraction. 

Hence, we can develop expression signatures for cells by which we can identify an 

unknown type of cell. This approach is thus applied to identify tumor samples. Firstly, 

using a training set, a signature is constructed. Then, using that signature samples are 

classified into different classes. 

This study deals with leukemia data (Golub et al, 1999 3 ). There are 72 samples 

comprising of classes Acute lymphoblastic leukemia(ALL) and Acute myeloid 

leukemia(AML). Subclasses exist within these two classes. 

The GA/KNN algorithm employs constructing subsets(the user specifies the size) of 

genes. The Genetic Algorithm is used to choose relatively few discriminative subsets of 

genes 1 . The KNN is used to decide whether a subset is discriminative or not. For a 

detailed description, please refer to Li et al. 1 

The p parameter comes in the distance formula which comes in the KNN calculation. The 

distance is calculated between n dimensional data points(samples), n being equal to the 

number of genes. If S1,S2,….Sm are the m samples. And g1k,g2k,….gnk are the n 

expression values of the n genes of sample Sk. Then using a p parameter of value 2, we 

would calculate distance between any two samples Sj and Sk as : 

D = [ (g1k – g1j) 2 +…….+(gnk – gnj) 2 ] 1/2

Similarly if the p parameter equals 5, then the formula will be as follows: 

Methods : 

D = [ (g1k – g1j) 5 +…….+(gnk – gnj) 5 ] 1/5 

Data: Golub data was obtained from BINF 733 website .The data was filtered and 

transformed to log scale as shown below (taken from Li et al. 1 ). 

The filtering was done by excluding genes whose expression levels were below 50 in 

more than 57 out of the 72 samples. 5455 genes were left after the filtering which were 

transformed to the log scale. The data set was tailored to act as an input for the GA/KNN 

algorithm. 

Usage of GA/KNN algorithm: 

The GA/KNN source code was kindly provided by Dr. Leping Li .First, it was run on the 

dataset without any modification. In that case the p parameter was equal to 2. Then the 

code was modified by changing the distance calculation used while calculating the Knearest 

neighbors. Then the modified source code was run for p parameter values 3, 4 ,5, 

6 and 7. Higher p values increased the run time of the program and the program did not 

successfully execute because of limited computing resources. 

The program was run several times on unix server(mason.gmu.edu) to find out the 

optimal parameters, taking into account the run time. The following parameters were 

found best suited for the golub data set: 

Chromosome length : 10 (higher values did not change the output) 

Number of near optimal solutions : 5000(higher values did not change the output) 

Termination fitness cutoff : 36 

K( in KNN) 3 

Number of training samples 38 

Number of test samples 34 

For a Detailed description of each of these parameters please refer to Li et al 1 . 

The results from the algorithm were analyzed and the number of wrong classifications 

were counted for each run(for each value of p parameter). Each run of the program 

yielded a Gene ranking. 50 Top ranked genes were taken in each case and the dataset was 

cut short from 5455 genes across 72 samples to 50 genes across 72 samples. This data 

was now plotted in its first two principal components using MATLAB.

Results : 

The GA/KNN did not work perfectly well in this study for the golub data. 12 – 14 

samples out of 34 were misclassified in each run. 

p ALL misclassifications AML misclassifications Total misclassifications 

2 6 7 13 out of 34 

3 0 14 14 out of 34 

4 5 7 12 out of 34 

5 0 14 14 out of 34 

6 7 7 14 out of 34 

7 0 14 14 out of 34 

For the p parameter value 3, 5, 7 the results are alike. Similarly we can see similarity 

between the results with even numbered p parameter. The striking feature is that that odd 

numbered p values fail to classify AML samples totally. The p parameter 4 seems to be 

the best choice for the dataset as it classifies 22 samples out of 34 correctly. There is not 

much difference between the outputs of even numbered p-values. Please Refer to the next 

page for explanations of the PCA graphs shown below. 

p = 2

p = 3 

p = 4

p = 5 

p = 6

p = 7 

The PCA of even numbered p parameter (except p = 6) runs seem to distinguish ALL and 

AML better. ALL is concentrated towards one end and the AML samples towards 

another. But in the odd numbered p value runs, The ALL samples are at the centre and 

are not so well separated and hence do not distinguish so well. The ALLs are closer to 

AML samples than in the even numbered p value runs. This is in agreement to the 

number of misclassifications counted for odd numbered p value runs. 

Conclusions 

The GA/KNN algorithm did not work really well for this study. It misclassified 12 -14 

samples out of 34 test samples. It was observed that the output of the GA/KNN algorithm 

changes with changing the p parameter. The even numbered p parameters seem to 

perform better than the odd numbered values. The odd ones fail to classify any AML 

samples, the reason for which remains unexplored. 

References 

1. Computational Analysis of Leukemia Microarray Expression Data using the 

GA/KNN Method by Leping Li, Lee G. Pedersen, Thomas A. Darden, and 

Clarice R. Weinberg 

The PCA graphs: 

For each value of p, the algo 

was run and PCA was done 

using the top 50 genes 

obtained. 

Green circles- B-cell ALL 

Green squares-T-cell ALL 

Red circles – M1 type AML 

Red square – M2 type AML 

Red diamond- M4 type 

AML 

Red Stars – M5 type AML

2. A tutorial on Principal Component Analysis by Lindsay I Smith. Feb 26,2002. 

3 .Molecular Classification ofCancer: Class Discovery andClass Prediction by Gene 

Expression Monitoring by Golub et al. 1999

This document was created with Win2PDF available at http://www.daneprairie.com. 

The unregistered version of Win2PDF is for evaluation or non-commercial use only.

Sample classification using the GA/KNN algorithm with varying ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?