02.07.2013 Views

Sample classification using the GA/KNN algorithm with varying ...

Sample classification using the GA/KNN algorithm with varying ...

Sample classification using the GA/KNN algorithm with varying ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Similarly if <strong>the</strong> p parameter equals 5, <strong>the</strong>n <strong>the</strong> formula will be as follows:<br />

Methods :<br />

D = [ (g1k – g1j) 5 +…….+(gnk – gnj) 5 ] 1/5<br />

Data: Golub data was obtained from BINF 733 website .The data was filtered and<br />

transformed to log scale as shown below (taken from Li et al. 1 ).<br />

The filtering was done by excluding genes whose expression levels were below 50 in<br />

more than 57 out of <strong>the</strong> 72 samples. 5455 genes were left after <strong>the</strong> filtering which were<br />

transformed to <strong>the</strong> log scale. The data set was tailored to act as an input for <strong>the</strong> <strong>GA</strong>/<strong>KNN</strong><br />

<strong>algorithm</strong>.<br />

Usage of <strong>GA</strong>/<strong>KNN</strong> <strong>algorithm</strong>:<br />

The <strong>GA</strong>/<strong>KNN</strong> source code was kindly provided by Dr. Leping Li .First, it was run on <strong>the</strong><br />

dataset <strong>with</strong>out any modification. In that case <strong>the</strong> p parameter was equal to 2. Then <strong>the</strong><br />

code was modified by changing <strong>the</strong> distance calculation used while calculating <strong>the</strong> Knearest<br />

neighbors. Then <strong>the</strong> modified source code was run for p parameter values 3, 4 ,5,<br />

6 and 7. Higher p values increased <strong>the</strong> run time of <strong>the</strong> program and <strong>the</strong> program did not<br />

successfully execute because of limited computing resources.<br />

The program was run several times on unix server(mason.gmu.edu) to find out <strong>the</strong><br />

optimal parameters, taking into account <strong>the</strong> run time. The following parameters were<br />

found best suited for <strong>the</strong> golub data set:<br />

Chromosome length : 10 (higher values did not change <strong>the</strong> output)<br />

Number of near optimal solutions : 5000(higher values did not change <strong>the</strong> output)<br />

Termination fitness cutoff : 36<br />

K( in <strong>KNN</strong>) 3<br />

Number of training samples 38<br />

Number of test samples 34<br />

For a Detailed description of each of <strong>the</strong>se parameters please refer to Li et al 1 .<br />

The results from <strong>the</strong> <strong>algorithm</strong> were analyzed and <strong>the</strong> number of wrong <strong>classification</strong>s<br />

were counted for each run(for each value of p parameter). Each run of <strong>the</strong> program<br />

yielded a Gene ranking. 50 Top ranked genes were taken in each case and <strong>the</strong> dataset was<br />

cut short from 5455 genes across 72 samples to 50 genes across 72 samples. This data<br />

was now plotted in its first two principal components <strong>using</strong> MATLAB.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!