Multivariate Pattern Classification

SPM 2010 - 13. Kurs zur funktionellen BildgebungMultivariate PatternClassificationThomas WolbersSpace and Ageing Laboratory (www.sal.mvm.ed.ac.uk(www.sal.mvm.ed.ac.uk)Centre for Cognitive and Neural SystemsOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONSWhy pattern class.?SPM-Kurs Hamburg 2010Time ( (scan scan)datadesign matrixparameterβ 1β 2β 3β 4β 5= • β 6+β 7β 8β 9β 10β 0y = X • β +errorGLM: separate model fitting for each voxel mass-univariateanalysis!εPattern Classification1

SPM 2010 - 13. Kurs zur funktionellen BildgebungWhy pattern class.?SPM-Kurs Hamburg 2010Key idea behind pattern classification GLM analysis relies exclusively on the information contained in the timecourse of individual voxels Multivariate analyses take advantage of the information contained inactivity patterns across space, from multiple voxels Cognitive/Sensorimotor states are expressed in the brain as distributedpatterns of brain activityGLMGLMWhy pattern class.?SPM-Kurs Hamburg 2010Advantages of multivariate pattern classification increase in sensitivity: weak information in single voxels isaccumulated across many voxels multiple regions/voxels may only carry info about brainstates when jointly analyzed can prevent information loss due to spatial smoothing(butsee Op de Beeck, , 2009 / Kamitani & Sawahata 2010) can preserve temporal resolution instead of characterizingaverage response across many trialsOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONSPattern Classification2

SPM 2010 - 13. Kurs zur funktionellen BildgebungSPM-Kurs Hamburg 2010BINOCULAR RIVALRYCan spontaneous changes in conscious experience be decodedfrom fMRI signals in early visual cortex?Haynes & Rees (2005). Current BiologyProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data whilesubject is viewing blue andred gratingsProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI dataPattern Classification3

SPM 2010 - 13. Kurs zur funktionellen BildgebungProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select relevant features(i.e. voxels)Processing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Convert each fMRIvolume into a vector thatreflects the pattern ofactivity across voxels atthat point in time.Processing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patternsaccording to whether thesubject was perceivingblue vs. red (adjusting forhemodynamic lag)Pattern Classification4

SPM 2010 - 13. Kurs zur funktionellen BildgebungProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train a classifier todiscriminate betweenblue patterns and redpatternsProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier7. Apply the trainedclassifier to new fMRIpatterns (not presented attraining).Processing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier7. Apply the trained classifierto new fMRI patterns (notpresented at training).8. CrossvalidationPattern Classification5

SPM 2010 - 13. Kurs zur funktionellen BildgebungProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier7. Apply the trained classifierto new fMRI patterns (notpresented at training).8. Crossvalidation9. Statistical inferenceSPM-Kurs Hamburg 2010Haynes & Rees (2005). Current BiologyOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONSPattern Classification6

SPM 2010 - 13. Kurs zur funktionellen BildgebungPreprocessingSPM-Kurs Hamburg 20101. (SliceTiming +) Realignment (SPM, FSL …)2. High-passfiltering / Detrending remove linear (and quadratic) trends (i.e. scannerdrift) remove low-frequencyartifacts (i.e. biosignals)3. Z-Scoring remove baseline shifts between scanning runs reduce impact of outliersFeature ReductionSPM-Kurs Hamburg 2010The problem fMRI data are typically sparse, high-dimensionaland noisy Classification is sensitive to information content in all voxels many uninformative voxels = poor classification (i.e. dueto overfitting)Solution 1: Feature selectionperformancenumber of features select subset with the most informativefeatures original features remain unchangedFeature SelectionSPM-Kurs Hamburg 2010‘External‘ Solutions Anatomical regions of interest Independent functional localizer (Haynes & Rees:retinotopic mapping to identify early visual areas) Searchlight classification: define region of interest (i.e.sphere) ) and move it across the search volume exploratory analysis‘Internal‘ univariate solutions activation vs. baseline (t-Test) mean difference between conditions (ANOVA) single voxel classification accuracyPattern Classification7

SPM 2010 - 13. Kurs zur funktionellen BildgebungFeature SelectionSPM-Kurs Hamburg 2010Pereira et al. (2009)Peeking #1 (ANOVA and classification only) testing a trained classifier needs to be performed onindependent test datasets if entire dataset is used for feature selection,classification estimates become overly optimistic nested cross-validationvalidation!Feature ExtractionSPM-Kurs Hamburg 2010Solution 1: Feature selection select subset from all available features original features remain unchangedSolution 2: Feature extraction create new features as a function ofexisting features Linear functions (PCA, ICA,…) Nonlinear functions duringclassification (i.e. hidden units in aneural network)OutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONSPattern Classification8

SPM 2010 - 13. Kurs zur funktionellen BildgebungClassificationSPM-Kurs Hamburg 2010Linear classificationvoxel 2hyperplanevolume in t 1volume in t 2training data2volume in t 3volume in t 4independenttest datavolume in t 254voxel 1our task: : find a hyperplane that separates both conditionsClassificationSPM-Kurs Hamburg 2010Linear classificationvoxel 2hyperplanevolume in t 1volume in t 2training data2volume in t 3volume in t 4independenttest datadecision function:volume in t 254 voxel 1y = f ( x)= w1x1+w2x2+... + wn xn+b• if y < 0, predict red // if y > 0, predict blue• prediction = linear function of featuresClassificationSPM-Kurs Hamburg 2010Linear classificationhyperplaneProject data on a new axis that maximes the class separabilityHyperplane is orthogonal to the best projection axisPattern Classification9

SPM 2010 - 13. Kurs zur funktionellen BildgebungClassificationSPM-Kurs Hamburg 2010Simplest Approach: Fisher Linear Discriminant (FLD)FLD classifies by projecting the training set on the axis that is definedby the difference between the center of mass for both classes,corrected by the within class scatterseparation is maximised for:m1−m2w =cov + covclass1class2ClassificationSPM-Kurs Hamburg 2010Linear classificationvoxel 2hyperplanevolume in t 1volume in t 2volume in t 4volume in t 3weightvolume in t 25vector wvoxel 1y = wx + bhyperplane defined by weight vector w andoffset bClassificationSPM-Kurs Hamburg 2010Linear classificationvoxel 2hyperplaneHow to interpret theweight vector?volume in t 1volume in t2volume in t 4volume in t 3volume in t 25weightvector wvoxel 1Weight vector (Discriminating Volume)W = [0.45 0.89] 0.45 0.89The value of each voxel in the weight vector indicates its importance indiscriminating between the two classes (i.e. cognitive states).Pattern Classification10

SPM 2010 - 13. Kurs zur funktionellen BildgebungClassificationSPM-Kurs Hamburg 2010Support Vector Machine (SVM)voxel 2voxel 1Which of the linear separators is the optimal one?ClassificationSPM-Kurs Hamburg 2010Support Vector Machine (SVM)voxel 2marginSVM = maximummargin classifiersupport vectorsvoxel 1If classes have overlapping distributions), SVM’s are modified to accountfor misclassification errors by introducing additional slack variablesClassificationSPM-Kurs Hamburg 2010Linear classifiers Fisher Linear Discriminant Support Vector Machine (SVM) Logistic Regression Gaussian Naive Bayes …Nonlinear classifiers SVM with kernel Neural Networks …How to choose the right classifier?Pattern Classification11

SPM 2010 - 13. Kurs zur funktionellen BildgebungClassificationSPM-Kurs Hamburg 2010Situation 1: scans , features¡(i.ei.e. whole brain data)FLD unsuitable: depends on reliable estimation of covariance matrixGNB inferior to SVM and LR the latter come with regularisationthat help weigh down the effects of noisy and highly correlatedfeaturesCox & Savoy (2003). NeuroImageClassificationSPM-Kurs Hamburg 2010Situation 2: scans¢, features¢(i.ei.e. feature selection orfeature extraction) GNB, SVM and LR: often similar performance SVM originally designed for two-classproblems only SVM for multiclass problems: : multiple binarycomparisons, voting scheme to identify classes accuracy of SVM increases faster than GNB when thenumber of scans increase see Mitchell et al. (2005) for further comparisonsbetween different classifiersClassificationSPM-Kurs Hamburg 2010Peeking #2 classifier performance = unbiased estimate ofclassification accuracy how well would the classifier label a new examplerandomly drawn from the same distribution? testing a trained classifier needs to be performed on adataset the classifier has never seen before if entire dataset is used for training a classifier,classification estimates become overly optimisticSolution: leave-oneone-outout crossvalidationPattern Classification12

SPM 2010 - 13. Kurs zur funktionellen BildgebungClassificationSPM-Kurs Hamburg 2010Crossvalidation standard approach: leave-oneone-outoutcrossvalidation split dataset into n folds (i.e. runs) train classifier on 1:n-1 folds test the trained classifier on fold n rerun training/testing whilewithholding a different fold repeat procedure until each fold hasbeen withheld once Classification accuracy usuallycomputed as mean accuracytraining settest setOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONSEvaluating resultsSPM-Kurs Hamburg 2010Can I publish my data with 57% classification accuracy inScience or Nature?Independent test data Classification accuracy = unbiased estimate of the true accuracyof the classifier Question: what is the probability of obtaining 57% accuracyunder the null hypothesis (no information about the variable ofinterest in my data)? Binary classification: p-valuepcan be calculated under a binomialdistribution with N trials (i.e.. 100) and P probability of success(i.e.. 0.5) Matlab: : p = 1 - binocdf(X,N,P) ) = 0.067 (hmm(hmm…)X = number of correctly labeled examples (i.e.. 57)Pattern Classification13

SPM 2010 - 13. Kurs zur funktionellen BildgebungEvaluating resultsSPM-Kurs Hamburg 2010Nonparametric approachesPermutation tests (i.e. Polyn et al, 2005): create a null distribution of performance values by repeatedlygenerating scrambled versions of the classifier output MVPA: wavelet based scrambling technique (Bullmoreet al., 2004) can accomodate non-independentdataBootstrapping estimate the variance and distribution of a statistic (i.e.. voxelweights) Multiple iterations of data resampling by drawing with replacementfrom the datasetMulticlass problems: accuracy can be painful average rank of the correct label average of all pairwise comparisonsGetting resultsSPM-Kurs Hamburg 2010Design considerationsacquire as many training examples as possible classifier needs tobe able to „seethrough the noise“averaging consecutive TR‘s can help to reduce the impact of noise(butmay also eliminate natural, , informative variation) alternative to averaging: use beta weights from a GLM analysis (i.e.(based on FIR or HRF) requires many runs / trialsavoid using consecutive scans for training a classifier lots ofhighly similar datapoints do not give new informationacquire as many test examples as possible increases the power ofsignificance testbalance conditions if not, classifier may tend to focus onpredominant conditionOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONSPattern Classification14

SPM 2010 - 13. Kurs zur funktionellen BildgebungApplicationsSPM-Kurs Hamburg 2010Pattern discrimination Question 1: do the selected fMRI data contain informationabout a variable of interest (i.e. conscious percept in Haynes &Rees)?Pattern localizationvoxel 2 Question 2: where in thebrain is information aboutthe variable of interestrepresented? weight vector contains infoon the importance of eachvoxel for differentiatingbetween classesvolume in t 1volume in t 2weightvector whyperplanevolume in t 4volume in t 3volume in t 25voxel 1ApplicationsSPM-Kurs Hamburg 2010Pattern localization - SpacePolyn et al. (2005), Science.ApplicationsSPM-Kurs Hamburg 2010Pattern localization - Space Searchlight analysis: classification/crossvalidationcrossvalidation isperformed on a voxel and its (spherical) neighbourhood classification accuracy is assigned to centre voxel searchlight is moved across entire dataset to obtain accuracyestimates for each voxel can be used for feature selection or to generate a brain map ofp-valuespositionclass.Hassabis et al. (2009), Current Biology.Pattern Classification15

SPM 2010 - 13. Kurs zur funktionellen BildgebungApplicationsSPM-Kurs Hamburg 2010Pattern localization - TimeQuestion 3: when does the brain represent information aboutdifferent classes?Motor intentionSoon et al. (2008), Nature Neuroscience.ApplicationsSPM-Kurs Hamburg 2010Pattern characterization Question 4: How are stimulus classes represented in the brain? goal: characterizing the relationship between stimulus classes andBOLD patterns Kay et al. (2008): training of a receptive field model for each voxel inV1, V2 and V3 based on location, spatial frequency and orientation(1750 natural images)subsequent classification of completelynew stimuli (120 natural images)TopicsSPM-Kurs Hamburg 2010Useful literature Haynes JD, Rees G (2006) Decoding mental states from brain activity in humans.Nat Rev Neurosci 7:523-534. Formisano E, De Martino F, Valente G (2008) Multivariate analysis of fMRI timeseries: classification and regression of brain responses using machine learning.Magn Reson Imaging 26(7):921-34. Kriegeskorte N, Goebel R, Bandettini P (2006) Information-based functional brainmapping. Proc Natl Acad Sci U S A 103:3863-3868. Mitchell TM, et al. (2004) Learning to Decode Cognitive States from Brain Images.Machine Learning 57:145-175. Norman KA, Polyn SM, Detre GJ, Haxby JV (2006) Beyond mind-reading: multivoxelpattern analysis of fMRI data. Trends Cogn Sci 10:424-430. O’Toole et al. (2007). Theoretical, statistical, and practical perspectives on patternbasedclassification approaches to the analysis of functional neuroimaging data. JCogn Neurosci.19(11):1735-52 Pereira F, Mitchell TM, Botvinick M (2009) Machine Learning Classifiers and fMRI:a tutorial overview. Neuroimage 45(1 Suppl):S199-209.Pattern Classification16

Multivariate Pattern Classification

Create successful ePaper yourself

Delete template?

Save as template?