12.07.2015 Views

Multivariate Pattern Classification

Multivariate Pattern Classification

Multivariate Pattern Classification

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

SPM 2010 - 13. Kurs zur funktionellen Bildgebung<strong>Multivariate</strong> <strong>Pattern</strong><strong>Classification</strong>Thomas WolbersSpace and Ageing Laboratory (www.sal.mvm.ed.ac.uk(www.sal.mvm.ed.ac.uk)Centre for Cognitive and Neural SystemsOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONSWhy pattern class.?SPM-Kurs Hamburg 2010Time ( (scan scan)datadesign matrixparameterβ 1β 2β 3β 4β 5= • β 6+β 7β 8β 9β 10β 0y = X • β +errorGLM: separate model fitting for each voxel mass-univariateanalysis!ε<strong>Pattern</strong> <strong>Classification</strong>1


SPM 2010 - 13. Kurs zur funktionellen BildgebungWhy pattern class.?SPM-Kurs Hamburg 2010Key idea behind pattern classification GLM analysis relies exclusively on the information contained in the timecourse of individual voxels <strong>Multivariate</strong> analyses take advantage of the information contained inactivity patterns across space, from multiple voxels Cognitive/Sensorimotor states are expressed in the brain as distributedpatterns of brain activityGLMGLMWhy pattern class.?SPM-Kurs Hamburg 2010Advantages of multivariate pattern classification increase in sensitivity: weak information in single voxels isaccumulated across many voxels multiple regions/voxels may only carry info about brainstates when jointly analyzed can prevent information loss due to spatial smoothing(butsee Op de Beeck, , 2009 / Kamitani & Sawahata 2010) can preserve temporal resolution instead of characterizingaverage response across many trialsOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONS<strong>Pattern</strong> <strong>Classification</strong>2


SPM 2010 - 13. Kurs zur funktionellen BildgebungSPM-Kurs Hamburg 2010BINOCULAR RIVALRYCan spontaneous changes in conscious experience be decodedfrom fMRI signals in early visual cortex?Haynes & Rees (2005). Current BiologyProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data whilesubject is viewing blue andred gratingsProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data<strong>Pattern</strong> <strong>Classification</strong>3


SPM 2010 - 13. Kurs zur funktionellen BildgebungProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select relevant features(i.e. voxels)Processing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Convert each fMRIvolume into a vector thatreflects the pattern ofactivity across voxels atthat point in time.Processing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patternsaccording to whether thesubject was perceivingblue vs. red (adjusting forhemodynamic lag)<strong>Pattern</strong> <strong>Classification</strong>4


SPM 2010 - 13. Kurs zur funktionellen BildgebungProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train a classifier todiscriminate betweenblue patterns and redpatternsProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier7. Apply the trainedclassifier to new fMRIpatterns (not presented attraining).Processing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier7. Apply the trained classifierto new fMRI patterns (notpresented at training).8. Crossvalidation<strong>Pattern</strong> <strong>Classification</strong>5


SPM 2010 - 13. Kurs zur funktionellen BildgebungProcessing streamSPM-Kurs Hamburg 20101. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier7. Apply the trained classifierto new fMRI patterns (notpresented at training).8. Crossvalidation9. Statistical inferenceSPM-Kurs Hamburg 2010Haynes & Rees (2005). Current BiologyOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONS<strong>Pattern</strong> <strong>Classification</strong>6


SPM 2010 - 13. Kurs zur funktionellen BildgebungPreprocessingSPM-Kurs Hamburg 20101. (SliceTiming +) Realignment (SPM, FSL …)2. High-passfiltering / Detrending remove linear (and quadratic) trends (i.e. scannerdrift) remove low-frequencyartifacts (i.e. biosignals)3. Z-Scoring remove baseline shifts between scanning runs reduce impact of outliersFeature ReductionSPM-Kurs Hamburg 2010The problem fMRI data are typically sparse, high-dimensionaland noisy <strong>Classification</strong> is sensitive to information content in all voxels many uninformative voxels = poor classification (i.e. dueto overfitting)Solution 1: Feature selectionperformancenumber of features select subset with the most informativefeatures original features remain unchangedFeature SelectionSPM-Kurs Hamburg 2010‘External‘ Solutions Anatomical regions of interest Independent functional localizer (Haynes & Rees:retinotopic mapping to identify early visual areas) Searchlight classification: define region of interest (i.e.sphere) ) and move it across the search volume exploratory analysis‘Internal‘ univariate solutions activation vs. baseline (t-Test) mean difference between conditions (ANOVA) single voxel classification accuracy<strong>Pattern</strong> <strong>Classification</strong>7


SPM 2010 - 13. Kurs zur funktionellen BildgebungFeature SelectionSPM-Kurs Hamburg 2010Pereira et al. (2009)Peeking #1 (ANOVA and classification only) testing a trained classifier needs to be performed onindependent test datasets if entire dataset is used for feature selection,classification estimates become overly optimistic nested cross-validationvalidation!Feature ExtractionSPM-Kurs Hamburg 2010Solution 1: Feature selection select subset from all available features original features remain unchangedSolution 2: Feature extraction create new features as a function ofexisting features Linear functions (PCA, ICA,…) Nonlinear functions duringclassification (i.e. hidden units in aneural network)OutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONS<strong>Pattern</strong> <strong>Classification</strong>8


SPM 2010 - 13. Kurs zur funktionellen Bildgebung<strong>Classification</strong>SPM-Kurs Hamburg 2010Linear classificationvoxel 2hyperplanevolume in t 1volume in t 2training data2volume in t 3volume in t 4independenttest datavolume in t 254voxel 1our task: : find a hyperplane that separates both conditions<strong>Classification</strong>SPM-Kurs Hamburg 2010Linear classificationvoxel 2hyperplanevolume in t 1volume in t 2training data2volume in t 3volume in t 4independenttest datadecision function:volume in t 254 voxel 1y = f ( x)= w1x1+w2x2+... + wn xn+b• if y < 0, predict red // if y > 0, predict blue• prediction = linear function of features<strong>Classification</strong>SPM-Kurs Hamburg 2010Linear classificationhyperplaneProject data on a new axis that maximes the class separabilityHyperplane is orthogonal to the best projection axis<strong>Pattern</strong> <strong>Classification</strong>9


SPM 2010 - 13. Kurs zur funktionellen Bildgebung<strong>Classification</strong>SPM-Kurs Hamburg 2010Simplest Approach: Fisher Linear Discriminant (FLD)FLD classifies by projecting the training set on the axis that is definedby the difference between the center of mass for both classes,corrected by the within class scatterseparation is maximised for:m1−m2w =cov + covclass1class2<strong>Classification</strong>SPM-Kurs Hamburg 2010Linear classificationvoxel 2hyperplanevolume in t 1volume in t 2volume in t 4volume in t 3weightvolume in t 25vector wvoxel 1y = wx + bhyperplane defined by weight vector w andoffset b<strong>Classification</strong>SPM-Kurs Hamburg 2010Linear classificationvoxel 2hyperplaneHow to interpret theweight vector?volume in t 1volume in t2volume in t 4volume in t 3volume in t 25weightvector wvoxel 1Weight vector (Discriminating Volume)W = [0.45 0.89] 0.45 0.89The value of each voxel in the weight vector indicates its importance indiscriminating between the two classes (i.e. cognitive states).<strong>Pattern</strong> <strong>Classification</strong>10


SPM 2010 - 13. Kurs zur funktionellen Bildgebung<strong>Classification</strong>SPM-Kurs Hamburg 2010Support Vector Machine (SVM)voxel 2voxel 1Which of the linear separators is the optimal one?<strong>Classification</strong>SPM-Kurs Hamburg 2010Support Vector Machine (SVM)voxel 2marginSVM = maximummargin classifiersupport vectorsvoxel 1If classes have overlapping distributions), SVM’s are modified to accountfor misclassification errors by introducing additional slack variables<strong>Classification</strong>SPM-Kurs Hamburg 2010Linear classifiers Fisher Linear Discriminant Support Vector Machine (SVM) Logistic Regression Gaussian Naive Bayes …Nonlinear classifiers SVM with kernel Neural Networks …How to choose the right classifier?<strong>Pattern</strong> <strong>Classification</strong>11


SPM 2010 - 13. Kurs zur funktionellen Bildgebung<strong>Classification</strong>SPM-Kurs Hamburg 2010Situation 1: scans , features¡(i.ei.e. whole brain data)FLD unsuitable: depends on reliable estimation of covariance matrixGNB inferior to SVM and LR the latter come with regularisationthat help weigh down the effects of noisy and highly correlatedfeaturesCox & Savoy (2003). NeuroImage<strong>Classification</strong>SPM-Kurs Hamburg 2010Situation 2: scans¢, features¢(i.ei.e. feature selection orfeature extraction) GNB, SVM and LR: often similar performance SVM originally designed for two-classproblems only SVM for multiclass problems: : multiple binarycomparisons, voting scheme to identify classes accuracy of SVM increases faster than GNB when thenumber of scans increase see Mitchell et al. (2005) for further comparisonsbetween different classifiers<strong>Classification</strong>SPM-Kurs Hamburg 2010Peeking #2 classifier performance = unbiased estimate ofclassification accuracy how well would the classifier label a new examplerandomly drawn from the same distribution? testing a trained classifier needs to be performed on adataset the classifier has never seen before if entire dataset is used for training a classifier,classification estimates become overly optimisticSolution: leave-oneone-outout crossvalidation<strong>Pattern</strong> <strong>Classification</strong>12


SPM 2010 - 13. Kurs zur funktionellen Bildgebung<strong>Classification</strong>SPM-Kurs Hamburg 2010Crossvalidation standard approach: leave-oneone-outoutcrossvalidation split dataset into n folds (i.e. runs) train classifier on 1:n-1 folds test the trained classifier on fold n rerun training/testing whilewithholding a different fold repeat procedure until each fold hasbeen withheld once <strong>Classification</strong> accuracy usuallycomputed as mean accuracytraining settest setOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONSEvaluating resultsSPM-Kurs Hamburg 2010Can I publish my data with 57% classification accuracy inScience or Nature?Independent test data <strong>Classification</strong> accuracy = unbiased estimate of the true accuracyof the classifier Question: what is the probability of obtaining 57% accuracyunder the null hypothesis (no information about the variable ofinterest in my data)? Binary classification: p-valuepcan be calculated under a binomialdistribution with N trials (i.e.. 100) and P probability of success(i.e.. 0.5) Matlab: : p = 1 - binocdf(X,N,P) ) = 0.067 (hmm(hmm…)X = number of correctly labeled examples (i.e.. 57)<strong>Pattern</strong> <strong>Classification</strong>13


SPM 2010 - 13. Kurs zur funktionellen BildgebungEvaluating resultsSPM-Kurs Hamburg 2010Nonparametric approachesPermutation tests (i.e. Polyn et al, 2005): create a null distribution of performance values by repeatedlygenerating scrambled versions of the classifier output MVPA: wavelet based scrambling technique (Bullmoreet al., 2004) can accomodate non-independentdataBootstrapping estimate the variance and distribution of a statistic (i.e.. voxelweights) Multiple iterations of data resampling by drawing with replacementfrom the datasetMulticlass problems: accuracy can be painful average rank of the correct label average of all pairwise comparisonsGetting resultsSPM-Kurs Hamburg 2010Design considerationsacquire as many training examples as possible classifier needs tobe able to „seethrough the noise“averaging consecutive TR‘s can help to reduce the impact of noise(butmay also eliminate natural, , informative variation) alternative to averaging: use beta weights from a GLM analysis (i.e.(based on FIR or HRF) requires many runs / trialsavoid using consecutive scans for training a classifier lots ofhighly similar datapoints do not give new informationacquire as many test examples as possible increases the power ofsignificance testbalance conditions if not, classifier may tend to focus onpredominant conditionOutlineSPM-Kurs Hamburg 2010WHY PATTERN CLASSIFICATION?PROCESSING STREAMPREPROCESSING / FEATURE REDUCTIONCLASSIFICATIONEVALUATING RESULTSAPPLICATIONS<strong>Pattern</strong> <strong>Classification</strong>14


SPM 2010 - 13. Kurs zur funktionellen BildgebungApplicationsSPM-Kurs Hamburg 2010<strong>Pattern</strong> discrimination Question 1: do the selected fMRI data contain informationabout a variable of interest (i.e. conscious percept in Haynes &Rees)?<strong>Pattern</strong> localizationvoxel 2 Question 2: where in thebrain is information aboutthe variable of interestrepresented? weight vector contains infoon the importance of eachvoxel for differentiatingbetween classesvolume in t 1volume in t 2weightvector whyperplanevolume in t 4volume in t 3volume in t 25voxel 1ApplicationsSPM-Kurs Hamburg 2010<strong>Pattern</strong> localization - SpacePolyn et al. (2005), Science.ApplicationsSPM-Kurs Hamburg 2010<strong>Pattern</strong> localization - Space Searchlight analysis: classification/crossvalidationcrossvalidation isperformed on a voxel and its (spherical) neighbourhood classification accuracy is assigned to centre voxel searchlight is moved across entire dataset to obtain accuracyestimates for each voxel can be used for feature selection or to generate a brain map ofp-valuespositionclass.Hassabis et al. (2009), Current Biology.<strong>Pattern</strong> <strong>Classification</strong>15


SPM 2010 - 13. Kurs zur funktionellen BildgebungApplicationsSPM-Kurs Hamburg 2010<strong>Pattern</strong> localization - TimeQuestion 3: when does the brain represent information aboutdifferent classes?Motor intentionSoon et al. (2008), Nature Neuroscience.ApplicationsSPM-Kurs Hamburg 2010<strong>Pattern</strong> characterization Question 4: How are stimulus classes represented in the brain? goal: characterizing the relationship between stimulus classes andBOLD patterns Kay et al. (2008): training of a receptive field model for each voxel inV1, V2 and V3 based on location, spatial frequency and orientation(1750 natural images)subsequent classification of completelynew stimuli (120 natural images)TopicsSPM-Kurs Hamburg 2010Useful literature Haynes JD, Rees G (2006) Decoding mental states from brain activity in humans.Nat Rev Neurosci 7:523-534. Formisano E, De Martino F, Valente G (2008) <strong>Multivariate</strong> analysis of fMRI timeseries: classification and regression of brain responses using machine learning.Magn Reson Imaging 26(7):921-34. Kriegeskorte N, Goebel R, Bandettini P (2006) Information-based functional brainmapping. Proc Natl Acad Sci U S A 103:3863-3868. Mitchell TM, et al. (2004) Learning to Decode Cognitive States from Brain Images.Machine Learning 57:145-175. Norman KA, Polyn SM, Detre GJ, Haxby JV (2006) Beyond mind-reading: multivoxelpattern analysis of fMRI data. Trends Cogn Sci 10:424-430. O’Toole et al. (2007). Theoretical, statistical, and practical perspectives on patternbasedclassification approaches to the analysis of functional neuroimaging data. JCogn Neurosci.19(11):1735-52 Pereira F, Mitchell TM, Botvinick M (2009) Machine Learning Classifiers and fMRI:a tutorial overview. Neuroimage 45(1 Suppl):S199-209.<strong>Pattern</strong> <strong>Classification</strong>16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!