Sample classification using the GA/KNN algorithm with varying ...
Sample classification using the GA/KNN algorithm with varying ...
Sample classification using the GA/KNN algorithm with varying ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Similarly if <strong>the</strong> p parameter equals 5, <strong>the</strong>n <strong>the</strong> formula will be as follows:<br />
Methods :<br />
D = [ (g1k – g1j) 5 +…….+(gnk – gnj) 5 ] 1/5<br />
Data: Golub data was obtained from BINF 733 website .The data was filtered and<br />
transformed to log scale as shown below (taken from Li et al. 1 ).<br />
The filtering was done by excluding genes whose expression levels were below 50 in<br />
more than 57 out of <strong>the</strong> 72 samples. 5455 genes were left after <strong>the</strong> filtering which were<br />
transformed to <strong>the</strong> log scale. The data set was tailored to act as an input for <strong>the</strong> <strong>GA</strong>/<strong>KNN</strong><br />
<strong>algorithm</strong>.<br />
Usage of <strong>GA</strong>/<strong>KNN</strong> <strong>algorithm</strong>:<br />
The <strong>GA</strong>/<strong>KNN</strong> source code was kindly provided by Dr. Leping Li .First, it was run on <strong>the</strong><br />
dataset <strong>with</strong>out any modification. In that case <strong>the</strong> p parameter was equal to 2. Then <strong>the</strong><br />
code was modified by changing <strong>the</strong> distance calculation used while calculating <strong>the</strong> Knearest<br />
neighbors. Then <strong>the</strong> modified source code was run for p parameter values 3, 4 ,5,<br />
6 and 7. Higher p values increased <strong>the</strong> run time of <strong>the</strong> program and <strong>the</strong> program did not<br />
successfully execute because of limited computing resources.<br />
The program was run several times on unix server(mason.gmu.edu) to find out <strong>the</strong><br />
optimal parameters, taking into account <strong>the</strong> run time. The following parameters were<br />
found best suited for <strong>the</strong> golub data set:<br />
Chromosome length : 10 (higher values did not change <strong>the</strong> output)<br />
Number of near optimal solutions : 5000(higher values did not change <strong>the</strong> output)<br />
Termination fitness cutoff : 36<br />
K( in <strong>KNN</strong>) 3<br />
Number of training samples 38<br />
Number of test samples 34<br />
For a Detailed description of each of <strong>the</strong>se parameters please refer to Li et al 1 .<br />
The results from <strong>the</strong> <strong>algorithm</strong> were analyzed and <strong>the</strong> number of wrong <strong>classification</strong>s<br />
were counted for each run(for each value of p parameter). Each run of <strong>the</strong> program<br />
yielded a Gene ranking. 50 Top ranked genes were taken in each case and <strong>the</strong> dataset was<br />
cut short from 5455 genes across 72 samples to 50 genes across 72 samples. This data<br />
was now plotted in its first two principal components <strong>using</strong> MATLAB.