12.07.2015 Views

here - Biomedical Computation Review

here - Biomedical Computation Review

here - Biomedical Computation Review

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

under the hoodUnder TheHoodBY RAY SOMORJAI, PhDTwin Curses Plague<strong>Biomedical</strong> Data AnalysisNoninvasive experimental techniques,such as magnetic resonance (MR),infrared, Raman and fluorescencespectroscopy, and more recently, mass spectroscopy(proteomics) and microarrays(genomics) have helped us better understand,diagnose and treat disease. These methodscreate huge number of features, on the orderof 1,000-10,000, resulting in Bellman’s curseof dimensionality: too many features (i.e.,dimensions). However, clinical reality frequentlylimits thenumber of availablesamples to the orderof 10-100. This leadsto the curse ofdataset sparsity: toofew samples. Thus,on the one hand, wehave a wealth ofinformation availablefor data analysis;on the other hand,statistically meaningfulanalysis is hampered by sample scarcity.Robust, reliable data classification (e.g., distinguishingbetween diseased and healthy conditions)requires a sample-to-feature ratio onthe order of 5-10,DETAILSRay Somorjai is Head of<strong>Biomedical</strong> Informatics at theInstitute for Biodiagnostics,National Research CouncilCanada. The three majorthrusts in his Group are supervisedclassification, with specialemphasis on handling the peculiaritiesof biomedical data,unsupervised classification (e.g.,EvIdent, a powerful fuzzy clusteringsoftware) and the mathematicalmodeling of thespread of infectious diseases(AIDS, SARS, etc.).instead of the initial1/10-1/1000. Whatcan be done?To lift the curse ofdimensionality andreduce the number offeatures to a manageablesize, we use featureextraction/selection(FES). FESreduces dimensionalityby identifying andeliminating redundantor irrelevant information.For microarraydata this is accomplishedby first identifyinggroups of correlatedgenes and defininggroup averages asnew features. Forspectra, neighboringfeatures arestrongly correlated,and t<strong>here</strong>fore the majorityof features are redundant. In addition, manyfeatures are “noise,” or are irrelevant for thedesired classification. Eliminating these yields amuch lower-dimensional feature space that sufficesfor accurate spectral characterization. Toidentify the spectral features to be eliminated,Dataset sparsity has more subtleconsequences, and lifting this curse ismore problematic. The ideal solution—acquiring more samples—is frequentlytoo expensive or even impracticable.we have developed an algorithm that helpsselect optimal sub-regions that are most relevantfor an accurate classification. Averagingadjacent spectral intensities leads to furtherreduction, while retaining spectral identity,which is important for interpretability of t<strong>here</strong>sulting features (e.g., MR peaks, essentiallyaverages of adjacent spectral intensities, aremanifestations of the presence of specific chemicalcompounds).Dataset sparsity has more subtle consequences,and lifting this curse is more problematic.The ideal solution—acquiring more samples—isfrequently too expensive or evenimpracticable. Yet, limited sample size may createclassifiers that give overoptimistic accuracies,even after feature space reduction. Robustclassifier creation requires enough samples tomeaningfully partition the data into training,validation and independent test sets. The trainingset is used for both FES and optimal classifierdevelopment. The validation set helps preventthe classifier from adapting to the peculiar-continued on page 2524 BIOMEDICAL COMPUTATION REVIEW Fall 2005 www.biomedicalcomputationreview.org

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!