02.07.2013 Views

Sample classification using the GA/KNN algorithm with varying ...

Sample classification using the GA/KNN algorithm with varying ...

Sample classification using the GA/KNN algorithm with varying ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Sample</strong> <strong>classification</strong> <strong>using</strong> <strong>the</strong> <strong>GA</strong>/<strong>KNN</strong> <strong>algorithm</strong> <strong>with</strong> <strong>varying</strong> values<br />

of <strong>the</strong> p parameter<br />

By<br />

E. Prabhu Raman<br />

George Mason University<br />

Abstract : The <strong>GA</strong>/<strong>KNN</strong> method developed by Leping Li 1 has been tested and is<br />

proved to classify ALL/AML samples efficiently. This study explores <strong>the</strong> results obtained<br />

after modifying <strong>the</strong> <strong>KNN</strong> method used in <strong>the</strong> <strong>GA</strong>/<strong>KNN</strong> method by <strong>varying</strong> <strong>the</strong><br />

minkowski’s p-parameter. The <strong>GA</strong>/<strong>KNN</strong> method did not efficiently classify <strong>the</strong> test<br />

samples in <strong>the</strong> runs performed. Out of 34 test samples, around 12 - 14 samples were<br />

misclassified in each run. It was observed that an odd numbered p parameter (3,5,7) gave<br />

<strong>the</strong> same <strong>classification</strong> result, it classified all of <strong>the</strong> samples as ALL. It were only <strong>the</strong><br />

even numbered parameters(2,4,6) which could classify some of <strong>the</strong> AML samples<br />

correctly in <strong>the</strong> test data.<br />

Introduction : With <strong>the</strong> invention of microarray technology, it is possible to measure<br />

<strong>the</strong> expression of thousands of genes at a time. A microarray image gives a snapshot of<br />

<strong>the</strong> activity of each gene of a cell at <strong>the</strong> time of mRNA extraction.<br />

Hence, we can develop expression signatures for cells by which we can identify an<br />

unknown type of cell. This approach is thus applied to identify tumor samples. Firstly,<br />

<strong>using</strong> a training set, a signature is constructed. Then, <strong>using</strong> that signature samples are<br />

classified into different classes.<br />

This study deals <strong>with</strong> leukemia data (Golub et al, 1999 3 ). There are 72 samples<br />

comprising of classes Acute lymphoblastic leukemia(ALL) and Acute myeloid<br />

leukemia(AML). Subclasses exist <strong>with</strong>in <strong>the</strong>se two classes.<br />

The <strong>GA</strong>/<strong>KNN</strong> <strong>algorithm</strong> employs constructing subsets(<strong>the</strong> user specifies <strong>the</strong> size) of<br />

genes. The Genetic Algorithm is used to choose relatively few discriminative subsets of<br />

genes 1 . The <strong>KNN</strong> is used to decide whe<strong>the</strong>r a subset is discriminative or not. For a<br />

detailed description, please refer to Li et al. 1<br />

The p parameter comes in <strong>the</strong> distance formula which comes in <strong>the</strong> <strong>KNN</strong> calculation. The<br />

distance is calculated between n dimensional data points(samples), n being equal to <strong>the</strong><br />

number of genes. If S1,S2,….Sm are <strong>the</strong> m samples. And g1k,g2k,….gnk are <strong>the</strong> n<br />

expression values of <strong>the</strong> n genes of sample Sk. Then <strong>using</strong> a p parameter of value 2, we<br />

would calculate distance between any two samples Sj and Sk as :<br />

D = [ (g1k – g1j) 2 +…….+(gnk – gnj) 2 ] 1/2


Similarly if <strong>the</strong> p parameter equals 5, <strong>the</strong>n <strong>the</strong> formula will be as follows:<br />

Methods :<br />

D = [ (g1k – g1j) 5 +…….+(gnk – gnj) 5 ] 1/5<br />

Data: Golub data was obtained from BINF 733 website .The data was filtered and<br />

transformed to log scale as shown below (taken from Li et al. 1 ).<br />

The filtering was done by excluding genes whose expression levels were below 50 in<br />

more than 57 out of <strong>the</strong> 72 samples. 5455 genes were left after <strong>the</strong> filtering which were<br />

transformed to <strong>the</strong> log scale. The data set was tailored to act as an input for <strong>the</strong> <strong>GA</strong>/<strong>KNN</strong><br />

<strong>algorithm</strong>.<br />

Usage of <strong>GA</strong>/<strong>KNN</strong> <strong>algorithm</strong>:<br />

The <strong>GA</strong>/<strong>KNN</strong> source code was kindly provided by Dr. Leping Li .First, it was run on <strong>the</strong><br />

dataset <strong>with</strong>out any modification. In that case <strong>the</strong> p parameter was equal to 2. Then <strong>the</strong><br />

code was modified by changing <strong>the</strong> distance calculation used while calculating <strong>the</strong> Knearest<br />

neighbors. Then <strong>the</strong> modified source code was run for p parameter values 3, 4 ,5,<br />

6 and 7. Higher p values increased <strong>the</strong> run time of <strong>the</strong> program and <strong>the</strong> program did not<br />

successfully execute because of limited computing resources.<br />

The program was run several times on unix server(mason.gmu.edu) to find out <strong>the</strong><br />

optimal parameters, taking into account <strong>the</strong> run time. The following parameters were<br />

found best suited for <strong>the</strong> golub data set:<br />

Chromosome length : 10 (higher values did not change <strong>the</strong> output)<br />

Number of near optimal solutions : 5000(higher values did not change <strong>the</strong> output)<br />

Termination fitness cutoff : 36<br />

K( in <strong>KNN</strong>) 3<br />

Number of training samples 38<br />

Number of test samples 34<br />

For a Detailed description of each of <strong>the</strong>se parameters please refer to Li et al 1 .<br />

The results from <strong>the</strong> <strong>algorithm</strong> were analyzed and <strong>the</strong> number of wrong <strong>classification</strong>s<br />

were counted for each run(for each value of p parameter). Each run of <strong>the</strong> program<br />

yielded a Gene ranking. 50 Top ranked genes were taken in each case and <strong>the</strong> dataset was<br />

cut short from 5455 genes across 72 samples to 50 genes across 72 samples. This data<br />

was now plotted in its first two principal components <strong>using</strong> MATLAB.


Results :<br />

The <strong>GA</strong>/<strong>KNN</strong> did not work perfectly well in this study for <strong>the</strong> golub data. 12 – 14<br />

samples out of 34 were misclassified in each run.<br />

p ALL mis<strong>classification</strong>s AML mis<strong>classification</strong>s Total mis<strong>classification</strong>s<br />

2 6 7 13 out of 34<br />

3 0 14 14 out of 34<br />

4 5 7 12 out of 34<br />

5 0 14 14 out of 34<br />

6 7 7 14 out of 34<br />

7 0 14 14 out of 34<br />

For <strong>the</strong> p parameter value 3, 5, 7 <strong>the</strong> results are alike. Similarly we can see similarity<br />

between <strong>the</strong> results <strong>with</strong> even numbered p parameter. The striking feature is that that odd<br />

numbered p values fail to classify AML samples totally. The p parameter 4 seems to be<br />

<strong>the</strong> best choice for <strong>the</strong> dataset as it classifies 22 samples out of 34 correctly. There is not<br />

much difference between <strong>the</strong> outputs of even numbered p-values. Please Refer to <strong>the</strong> next<br />

page for explanations of <strong>the</strong> PCA graphs shown below.<br />

p = 2


p = 3<br />

p = 4


p = 5<br />

p = 6


p = 7<br />

The PCA of even numbered p parameter (except p = 6) runs seem to distinguish ALL and<br />

AML better. ALL is concentrated towards one end and <strong>the</strong> AML samples towards<br />

ano<strong>the</strong>r. But in <strong>the</strong> odd numbered p value runs, The ALL samples are at <strong>the</strong> centre and<br />

are not so well separated and hence do not distinguish so well. The ALLs are closer to<br />

AML samples than in <strong>the</strong> even numbered p value runs. This is in agreement to <strong>the</strong><br />

number of mis<strong>classification</strong>s counted for odd numbered p value runs.<br />

Conclusions<br />

The <strong>GA</strong>/<strong>KNN</strong> <strong>algorithm</strong> did not work really well for this study. It misclassified 12 -14<br />

samples out of 34 test samples. It was observed that <strong>the</strong> output of <strong>the</strong> <strong>GA</strong>/<strong>KNN</strong> <strong>algorithm</strong><br />

changes <strong>with</strong> changing <strong>the</strong> p parameter. The even numbered p parameters seem to<br />

perform better than <strong>the</strong> odd numbered values. The odd ones fail to classify any AML<br />

samples, <strong>the</strong> reason for which remains unexplored.<br />

References<br />

1. Computational Analysis of Leukemia Microarray Expression Data <strong>using</strong> <strong>the</strong><br />

<strong>GA</strong>/<strong>KNN</strong> Method by Leping Li, Lee G. Pedersen, Thomas A. Darden, and<br />

Clarice R. Weinberg<br />

The PCA graphs:<br />

For each value of p, <strong>the</strong> algo<br />

was run and PCA was done<br />

<strong>using</strong> <strong>the</strong> top 50 genes<br />

obtained.<br />

Green circles- B-cell ALL<br />

Green squares-T-cell ALL<br />

Red circles – M1 type AML<br />

Red square – M2 type AML<br />

Red diamond- M4 type<br />

AML<br />

Red Stars – M5 type AML


2. A tutorial on Principal Component Analysis by Lindsay I Smith. Feb 26,2002.<br />

3 .Molecular Classification ofCancer: Class Discovery andClass Prediction by Gene<br />

Expression Monitoring by Golub et al. 1999


This document was created <strong>with</strong> Win2PDF available at http://www.daneprairie.com.<br />

The unregistered version of Win2PDF is for evaluation or non-commercial use only.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!