under the hoodUnder TheHoodBY RAY SOMORJAI, PhDTwin Curses Plague<strong>Biomedical</strong> Data AnalysisNoninvasive experimental techniques,such as magnetic resonance (MR),infrared, Raman and fluorescencespectroscopy, and more recently, mass spectroscopy(proteomics) and microarrays(genomics) have helped us better understand,diagnose and treat disease. These methodscreate huge number of features, on the orderof 1,000-10,000, resulting in Bellman’s curseof dimensionality: too many features (i.e.,dimensions). However, clinical reality frequentlylimits thenumber of availablesamples to the orderof 10-100. This leadsto the curse ofdataset sparsity: toofew samples. Thus,on the one hand, wehave a wealth ofinformation availablefor data analysis;on the other hand,statistically meaningfulanalysis is hampered by sample scarcity.Robust, reliable data classification (e.g., distinguishingbetween diseased and healthy conditions)requires a sample-to-feature ratio onthe order of 5-10,DETAILSRay Somorjai is Head of<strong>Biomedical</strong> Informatics at theInstitute for Biodiagnostics,National Research CouncilCanada. The three majorthrusts in his Group are supervisedclassification, with specialemphasis on handling the peculiaritiesof biomedical data,unsupervised classification (e.g.,EvIdent, a powerful fuzzy clusteringsoftware) and the mathematicalmodeling of thespread of infectious diseases(AIDS, SARS, etc.).instead of the initial1/10-1/1000. Whatcan be done?To lift the curse ofdimensionality andreduce the number offeatures to a manageablesize, we use featureextraction/selection(FES). FESreduces dimensionalityby identifying andeliminating redundantor irrelevant information.For microarraydata this is accomplishedby first identifyinggroups of correlatedgenes and defininggroup averages asnew features. Forspectra, neighboringfeatures arestrongly correlated,and t<strong>here</strong>fore the majorityof features are redundant. In addition, manyfeatures are “noise,” or are irrelevant for thedesired classification. Eliminating these yields amuch lower-dimensional feature space that sufficesfor accurate spectral characterization. Toidentify the spectral features to be eliminated,Dataset sparsity has more subtleconsequences, and lifting this curse ismore problematic. The ideal solution—acquiring more samples—is frequentlytoo expensive or even impracticable.we have developed an algorithm that helpsselect optimal sub-regions that are most relevantfor an accurate classification. Averagingadjacent spectral intensities leads to furtherreduction, while retaining spectral identity,which is important for interpretability of t<strong>here</strong>sulting features (e.g., MR peaks, essentiallyaverages of adjacent spectral intensities, aremanifestations of the presence of specific chemicalcompounds).Dataset sparsity has more subtle consequences,and lifting this curse is more problematic.The ideal solution—acquiring more samples—isfrequently too expensive or evenimpracticable. Yet, limited sample size may createclassifiers that give overoptimistic accuracies,even after feature space reduction. Robustclassifier creation requires enough samples tomeaningfully partition the data into training,validation and independent test sets. The trainingset is used for both FES and optimal classifierdevelopment. The validation set helps preventthe classifier from adapting to the peculiar-continued on page 2524 BIOMEDICAL COMPUTATION REVIEW Fall 2005 www.biomedicalcomputationreview.org
ities of a finite training set (overfitting) by monitoringthe progress of the FES/classifier. Theindependent test set is used for external crossvalidation,but only after completion of the FESand identification of the final classifier. Withsmall datasets, even partitioning into trainingand test set is statistically suspect, and k-foldcross-validation is used: the dataset is split into kequal parts (~5-10), trained on k-1 parts andtested on the remaining portion. One thencycles through k times and averages the testresults. For small sample sizes, the variance ofthe averaged test accuracies tends to be unacceptablylarge, while overtraining is still a threat.For highly imbalanced classes (e.g., rare diseasevs. healthy), overall classification accuracycan be misleading. For example, consider 90samples in the healthy class, but only 10 in thedisease class. Misclassifying all 10 still gives 90%overall accuracy. Hence, balanced sensitivityand specificity (i.e., comparable accuracies forboth classes) is more appropriate, and can beachieved by undersampling, oversampling or bypenalizing misclassifications differently for differentclasses. (Differing misclassification costsfor the classes is an example.)For each sample, we compute class probabilities.This is relevant clinically (e.g., additionaltests would be suggested if a classifier assigned apatient to the disease class with 55% probability,immediate treatment would commence ifthis probability were 90%.)In the biomedical field, the twin curses aregenerally active. They both must be dealt within concert, otherwise overly optimistic and frequentlywrong conclusions will result. ■SeeingScience continuedEcce Homology is aphysically interactivenew-media work thatvisualizes genetic dataas calligraphic forms.With a name inspired by FriedrichNietzsche’s Ecce Homo, a meditation onhow one becomes what one is, the projectexplores human evolution by examiningsimilarities between genes fromhuman beings and a target organism, inthis case the rice plant. Ecce Homology isa physically interactive new-media workthat visualizes genetic data as calligraphicforms. A novel computer-vision basedinterface allows multiple participants,through their movement in the installationspace, to select genes from thehuman genome for visualization usingthe Basic Local Alignment Search Tool(BLAST). Five projectors present thesechanges in Ecce Homology’s calligraphicforms across a 40-foot wide wall.“If we worked on the genomic calligraphyvisualization further, it couldbe useful to scientists,” she says, “butthe installation is not a tool; it’s art.And it’s specifically ambiguous and abit mysterious—by intention.”Ecce Homology, which was first displayedtwo years ago at the FowlerMuseum in Los Angeles, works on manylevels both scientifically and artistically.“People assume that t<strong>here</strong>’s value in thevast amounts ofgenomic data we aregenerating,” says West,“but data is not knowledge,and in order forus to derive knowledgefrom it, we need tointerpret it. The morecomplex it is, the harderit is for humanbeings to do that and,consequently, thegreater our need to findnew approaches.” So,says West, “we’ve producedan artwork thatboth speaks to thisneed and lets viewersinteract fluidly with thedata in a visceral way.”Ultimately, West says, the exhibitposes the question, “If you were to dowork that’s truly hybrid art/science,what would that process be like? Andwould t<strong>here</strong> be any outcome that wouldpoint to how art might nurture scientificdiscovery?”For more information about EcceHomology, visit www.insilicov1.org. ■Ecce Homology’s custom softwaretransforms strings of genetic codeinto luminous, scientifically accuratevisualizations that incorporatemultiple biological features. Forprotein sequences, the strokeplacement, shape and brush qualityare determined by physical andchemical properties, such as theproportion of mass to volume,hydrophobicity, or ionization of theamino acids. The visualization iscreated from amino-acid sequencechunks that are segmented by a“turn prediction” algorithm. Eachsegment’s corresponding calligraphicstroke is connected to itsneighbor by a connection whoseshape is based on a secondarystructure property of the segment.The result resembles calligraphy.Courtesy: Ruth Westwww.biomedicalcomputationreview.orgFall 2005 BIOMEDICAL COMPUTATION REVIEW 25