06.12.2012 Views

GTM: The Generative Topographic Mapping - Aston University ...

GTM: The Generative Topographic Mapping - Aston University ...

GTM: The Generative Topographic Mapping - Aston University ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

DeptofComputerScience&AppliedMathematics NeuralComputingResearchGroup BirminghamB47ET <strong>Aston</strong><strong>University</strong><br />

http://www.ncrg.aston.ac.uk/ Fax:+44(0)1213334586 Tel:+44(0)1213334631 UnitedKingdom<br />

<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><br />

ChristopherM.Bishop <strong>Topographic</strong><strong>Mapping</strong><br />

C.M.Bishop@aston.ac.uk ChristopherK.I.Williams svensjfm@aston.ac.uk MarkusSvensen<br />

TechnicalReportNCRG/96/015C.K.I.Williams@aston.ac.uk AcceptedforpublicationinNeuralComputation. April16,1997<br />

Latentvariablemodelsrepresenttheprobabilitydensityofdatainaspaceofseveraldimensions whichisbasedonalineartransformationsbetweenthelatentspaceandthedataspace.Inthis intermsofasmallernumberoflatent,orhidden,variables.Afamiliarexampleisfactoranalysis Abstract<br />

paperweintroduceaformofnon-linearlatentvariablemodelcalledthe<strong>Generative</strong><strong>Topographic</strong> <strong>Mapping</strong>forwhichtheparametersofthemodelcanbedeterminedusingtheEMalgorithm.<strong>GTM</strong> providesaprincipledalternativetothewidelyusedSelf-OrganizingMap(SOM)ofKohonen(1982), andovercomesmostofthesignicantlimitationsoftheSOM.Wedemonstratetheperformanceof the<strong>GTM</strong>algorithmonatoyproblemandonsimulateddatafromowdiagnosticsforamulti-phase oilpipeline.


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 1 Introduction 2<br />

structureistomodelthedistributionofthedataintermsoflatent,orhidden,variables.A Manydatasetsexhibitsignicantcorrelationsbetweenthevariables.Onewaytocapturesuch familiarexampleofthisapproachisfactoranalysis,whichisbasedonalineartransformation beextendedtoallownon-lineartransformationswhileremainingcomputationallytractable.This fromlatentspacetodataspace.Inthispaperweshowhowthelatentvariableframeworkcan algorithm. Oneofthemotivationsforthisworkistoprovideaprincipledalternativetothewidelyused`self- mixtureofGaussianswhoseparameterscanbeoptimizedusingtheEM(expectation-maximization) leadstothe<strong>GTM</strong>(<strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong>)algorithm,whichisbasedonaconstrained<br />

organizingmap'(SOM)algorithm(Kohonen1982)inwhichasetofunlabelleddatavectorstn (n=1;:::;N)inaD-dimensionaldataspaceissummarizedintermsofasetofreferencevectors havingaspatialorganizationcorrespondingtoa(generally)two-dimensionalsheet.Whilethis algorithmhasachievedmanysuccessesinpracticalapplications,italsosuersfromsomesignicant costfunction,thelackofatheoreticalbasisforchoosinglearningrateparameterschedulesand neighbourhoodparameterstoensuretopographicordering,theabsenceofanygeneralproofsof deciencies,manyofwhicharehighlightedinKohonen(1995).<strong>The</strong>yinclude:theabsenceofa convergence,andthefactthatthemodeldoesnotdeneaprobabilitydensity.<strong>The</strong>seproblemscan<br />

usedinvisualizationareregardedasdeningaprojectionfromtheD-dimensionaldataspace Animportantapplicationoflatentvariablemodelsistodatavisualization.Manyofthemodels overcomesmostofthelimitationsoftheSOMwhileintroducingnosignicantdisadvantages. allbetracedtotheheuristicoriginsoftheSOMalgorithm1.Weshowthatthe<strong>GTM</strong>algorithm<br />

ontoatwo-dimensionalvisualizationspace.Weshallseethat,bycontrast,the<strong>GTM</strong>modelis datavisualization,themappingistheninvertedusingBayes'theorem,givingrisetoaposterior denedintermsofamappingfromthelatentspaceintothedataspace.Forthepurposesof distributioninlatentspace. 2<strong>The</strong>goalofalatentvariablemodelistondarepresentationforthedistributionp(t)ofdataina D-dimensionalspacet=(t1;:::;tD)intermsofanumberLoflatentvariablesx=(x1;:::;xL). LatentVariables<br />

Thisisachievedbyrstconsideringafunctiony(x;W)whichmapspointsxinthelatentspace intocorrespondingpointsy(x;W)inthedataspace.<strong>The</strong>mappingisgovernedbyamatrix ofparametersW,andcouldconsist,forexample,ofafeed-forwardneuralnetworkinwhich caseWwouldrepresenttheweightsandbiases.Weareinterestedinthesituationinwhichthe dimensionalityLofthelatent-variablespaceislessthanthedimensionalityDofthedataspace, sincewewishtocapturethefactthatthedataitselfhasanintrinsicdimensionalitywhichisless thanD.<strong>The</strong>transformationy(x;W)thenmapsthelatent-variablespaceintoanL-dimensional non-EuclideanmanifoldSembeddedwithinthedataspace2.Thisisillustratedschematicallyfor thecaseofL=2andD=3inFigure1. Ifwedeneaprobabilitydistributionp(x)onthelatent-variablespace,thiswillinduceacorre-<br />

goalhereisnotneuro-biologicalmodelling,butratherthedevelopmentofeectivealgorithmsfordataanalysis,for spondingdistributionp(yjW)inthedataspace.Weshallrefertop(x)asthepriordistributionof<br />

whichbiologicalrealismneednotbeconsidered. xforreasonswhichwillbecomeclearshortly.SinceL


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 3<br />

y(x;W)<br />

x2 x<br />

t<br />

connedtotheL-dimensionalmanifoldandhencewouldbesingular.Sinceinrealitythedatawill Figure1:<strong>The</strong>non-linearfunctiony(x;W)denesamanifoldSembeddedindataspacegiven bytheimageofthelatent-variablespaceunderthemappingx!y.<br />

2<br />

1<br />

t3 onlyapproximatelyliveonalower-dimensionalmanifold,itisappropriatetoincludeanoisemodel forthetvector.Wechoosethedistributionoft,forgivenxandW,tobearadially-symmetric Gaussiancentredony(x;W)havingvariance?1sothat<br />

Notethatothermodelsforp(tjx)mightalsobeappropriate,suchasaBernoulliforbinaryvariables (withasigmoidtransformationofy)oramultinomialformutuallyexclusiveclasses(witha p(tjx;W;)=2D=2exp?2ky(x;W)?tk2: (1)<br />

`softmax',ornormalizedexponentialtransformationofy(Bishop1995)),orevencombinationsof these.<strong>The</strong>distributionint-space,foragivenvalueofW,isthenobtainedbyintegrationoverthe<br />

W,andtheinversevariance,usingmaximumlikelihood.Inpracticeitisconvenienttomaximize ForagivenadatasetD=(t1;:::;tN)ofNdatapoints,wecandeterminetheparametermatrix<br />

Oncewehavespeciedthepriordistributionp(x)andthefunctionalformofthemappingy(x;W), wecaninprincipledetermineWandbymaximizingL(W;).However,theintegraloverxin (2)will,ingeneral,beanalyticallyintractable.Ifwechoosey(x;W)tobealinearfunctionof W,andwechoosep(x)tobeGaussian,thentheintegralbecomesaconvolutionoftwoGaussians whichisitselfaGaussian.Foranoisedistributionp(tjx)whichisGaussianwithadiagonal themaximumlikelihoodsolutionforWhascolumnsgivenbythescaledprincipaleigenvectors. symmetricGaussiangivenby(1)themodeliscloselyrelatedtoprincipalcomponentanalysissince covariancematrix,weobtainthestandardfactoranalysismodel.Inthecaseoftheradially Herewewishtoextendthisformalismtonon-linearfunctionsy(x;W),andinparticulartodevelop amodelwhichissimilarinspirittotheSOMalgorithm.Wethereforeconsideraspecicformfor p(x)givenbyasumofdeltafunctionscentredonthenodesofaregulargridinlatentspace<br />

x-distribution p(tjW;)=Zp(tjx;W;)p(x)dx: (2)<br />

theloglikelihood,givenby L(W;)=lnNYn=1p(tnjW;): (3)<br />

t 1<br />

p(x)=1KKXi=1(x?xi) (4)<br />

S


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 4<br />

y(x;W)<br />

x2 Figure2:InordertoformulatealatentvariablemodelwhichissimilarinspirittotheSOM, weconsiderapriordistributionp(x)consistingofasuperpositionofdeltafunctions,<br />

t2 x1 t<br />

Gaussiandistribution. locatedatthenodesofaregulargridinlatentspace.Eachnodexiismappedtoa correspondingpointy(xi;W)indataspace,andformsthecentreofacorresponding<br />

3<br />

inwhichcasetheintegralin(2)canagainbeperformedanalytically.Eachpointxiisthenmapped toacorrespondingpointy(xi;W)indataspace,whichformsthecentreofaGaussiandensity function,asillustratedinFigure2.From(2)and(4)weseethatthedistributionfunctionindata<br />

andtheloglikelihoodfunctionbecomes<br />

Fortheparticularnoisemodelp(tjx;W;)givenby(1),thedistributionp(tjW;)corresponds toaconstrainedGaussianmixturemodel(Hinton,Williams,andRevow1992)sincethecentres oftheGaussians,givenbyy(xi;W),cannotmoveindependentlybutarerelatedthroughthe functiony(x;W).Notethat,providedthemappingfunctiony(x;W)issmoothandcontinuous, theprojectedpointsy(xi;W)willnecessarilyhaveatopographicorderinginthesensethatany twopointsxAandxBwhicharecloseinlatentspacewillmaptopointsy(xA;W)andy(xB;W) whicharecloseindataspace. 2.1<strong>The</strong>EMAlgorithm Ifwenowchooseaparticularparametrizedformfory(x;W)whichisadierentiablefunctionof W(forexample,afeed-forwardnetworkwithsigmoidalhiddenunits)thenwecanusestandard techniquesfornon-linearoptimization,suchasconjugategradientsorquasi-Newtonmethods,to EM(expectation-maximization)algorithm(Dempster,Laird,andRubin1977;Bishop1995).By However,ourmodelconsistsofamixturedistributionwhichsuggeststhatwemightseekan makingasuitablechoiceofmodely(x;W)wewillseethattheM-stepcorrespondstothesolution ndaweightmatrixW,andaninversevariance,whichmaximizeL(W;).<br />

ofasetoflinearequations.Inparticularweshallchoosey(x;W)tobegivenbyageneralized<br />

spacethentakestheform p(tjW;)=1KKXi=1p(tjxi;W;) (5)<br />

L(W;)=NXn=1ln(1KKXi=1p(tnjxi;W;)): (6)<br />

t 1


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> linearregressionmodeloftheform 5<br />

wheretheelementsof(x)consistofMxedbasisfunctionsj(x),andWisaDMmatrix. Generalizedlinearregressionmodelspossessthesameuniversalapproximationcapabilitiesasmultilayeradaptivenetworks,providedthebasisfunctionsj(x)arechosenappropriately.<strong>The</strong>usual y(x;W)=W(x) (7)<br />

limitationofsuchmodels,however,isthatthenumberofbasisfunctionsmusttypicallygrow exponentiallywiththedimensionalityLoftheinputspace(Bishop1995).Inthepresentcontext thisisnotasignicantproblemsincethedimensionalityisgovernedbythenumberoflatent variablevariableswhichwilltypicallybesmall.Infactfordatavisualizationapplicationswe componentwhichgeneratedeachdatapointtnisunknown.WecanformulatetheEMalgorithm forthismodelasfollows.First,supposethat,atsomepointinthealgorithm,thecurrentweight generallyuseL=2. <strong>The</strong>maximizationof(6)canberegardedasamissing-dataprobleminwhichtheidentityiofthe matrixisgivenbyWoldandthecurrentinversenoisevarianceisgivenbyold.IntheE-step weuseWoldandoldtoevaluatetheposteriorprobabilities,orresponsibilities,ofeachGaussian componentiforeverydatapointtnusingBayes'theoremintheform Rin(Wold;old)=p(xijtn;Wold;old) = XKi0=1p(tnjxi0;Wold;old): p(tnjxi;Wold;old) (9) (8)<br />

Wenowconsidertheexpectationofthecomplete-dataloglikelihoodintheform<br />

Maximizing(10)withrespecttoW,andusing(1)and(7),weobtain hLcomp(W;)i=NXn=1KXi=1Rin(Wold;old)lnfp(tnjxi;W;)g: NXn=1KXi=1Rin(Wold;old)fWnew(xi)?tngT(xi)=0:<br />

(10)<br />

Thiscanconvenientlybewritteninmatrixnotationintheform TGoldWTnew=TRoldT<br />

(11)<br />

RisaKNmatrixwithelementsRin,andGisaKKdiagonalmatrixwithelements whereisaKMmatrixwithelementsij=j(xi),TisaNDmatrixwithelementstnk, (12)<br />

Wecannowsolve(12)forWnewusingstandardmatrixinversiontechniques,basedonsingular valuedecompositiontoallowforpossibleill-conditioning.Notethatthematrix Gii=NXn=1Rin(W;): isconstant (13)<br />

Similarly,maximizing(10)withrespecttoweobtainthefollowingre-estimationformula throughoutthealgorithm,andsoneedonlybeevaluatedonceatthestart.<br />

<strong>The</strong>EMalgorithmalternatesbetweentheE-step,correspondingtotheevaluationoftheposterior 1new=1 NDNXn=1KXi=1Rin(Wold;old)kWnew(xi)?tnk2: (14)<br />

probabilitiesin(9),andtheM-step,givenbythesolutionof(12)and(14).Jensen'sinequality


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> canbeusedtoshowthat,ateachiterationofthealgorithm,theobjectivefunctionwillincrease 6<br />

unlessitisalreadyata(local)maximum,asdiscussedforexampleinBishop(1995).Typically theEMalgorithmgivessatisfactoryconvergenceafterafewtensofcycles,particularlysincewe areprimarilyinterestedinconvergenceofthedistributionandthisisoftenachievedmuchmore rapidlythanconvergenceoftheparametersthemselves. Ifdesired,aregularizationtermcanbeaddedtotheobjectivefunctiontocontrolthemapping y(x;W).ThiscanbeinterpretedasaMAP(maximuma-posteriori)estimatorcorrespondingtoa choiceofpriorovertheweightsW.Inthecaseofaradially-symmetricGaussianprioroftheform<br />

whereistheregularizationcoecient,thisleadstoamodicationoftheM-step(12)togive p(Wj)=2MD=2exp8


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 7<br />

Figure3:ExamplesofmanifoldsgeneratedbysamplingfromthepriordistributionoverW left-handplot(wheresisthespacingofthebasisfunctioncentres),and=2s intheright-handplot.Dierentvaluesofsimplyaectthelinearscalingofthe ofthemanifold.HerethebasisfunctionsareGaussianwithwidth=4sinthe givenby(15),showingtheeectofthechoiceofbasisfunctionsonthesmoothness<br />

isitselfpartofthisprior.Inthe<strong>GTM</strong>algorithm,thepriordistributionovermappingfunctions y(x;W)isgovernedbytheprioroverweightsW,givenforexampleby(15),aswellasbythe embeddedmanifold.<br />

basisfunctions.Wetypicallychoosethebasisfunctionsj(x)toberadiallysymmetricGaussians whosecentresaredistributedonauniformgridinx-space,withacommonwidthparameter, whosevalue,alongwiththenumberandspacingofthebasisfunctions,determinesthesmoothness ofthemanifold.ExamplesofsurfacesgeneratedbysamplingthepriorareshowninFigure3. Inadditiontothebasisfunctionsi(x),itisalsonecessarytoselectthelatent-spacesamplepoints thentheGaussianmixturecentresindataspacebecomerelativelyindependentandthedesired dicultybeyondincreasedcomputationalcost.Inparticular,thereisno`over-tting'ifthenumber fxig.Notethat,iftherearetoofewsamplepointsinrelationtothenumberofbasisfunctions, smoothnesspropertiescanbelost.Havingalargenumberofsamplepoints,however,causesno ofsamplepointsisincreasedsincethenumberofdegreesoffreedominthemodeliscontrolledby themappingfunctiony(x;W).Onewaytoviewtheroleofthelatentspacesamplesfxigisas aMonteCarloapproximationtotheintegraloverxin(2)(MacKay1995;Bishop,Svensen,and function. ofatwo-dimensionallatentspace,O(100)samplepointsliewithin2ofthecentreofeachbasis Williams1996a).<strong>The</strong>choiceofthenumberKandlocationofthesamplepointsxiinlatentspace<br />

Notethatwehaveconsideredthebasisfunctionparameters(widthsandlocations)tobexed,with isnotcritical,andwetypicallychooseGaussianbasisfunctionsandsetKsothat,inthecase<br />

aGaussianpriorontheweightmatrixW.Inprinciple,priorsoverthebasisfunctionparameters couldalsobeintroduced,andthesecouldagainbetreatedbyMAPestimationorbyBayesian analysis.Todothis,werstevaluatethedatacovariancematrixandobtaintherstandsecond integration. WeinitializetheparametersWsothatthe<strong>GTM</strong>modelinitiallyapproximatesprincipalcomponent principaleigenvectors,andthenwedetermineWbyminimizingtheerrorfunction<br />

wherethecolumnsofUaregivenbytheeigenvectors.Thisrepresentsthesum-of-squareserror<br />

E=12XikW(xi)?Uxik (20)


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> betweentheprojectionsofthelatentpointsintodataspacebythe<strong>GTM</strong>modelandthecorre- 8<br />

Finally,wenotethatinanumericalimplementationcaremustbetakenovertheevaluationof orthesquareofhalfofthegridspacingofthePCA-projectedlatentpointsindataspace. spondingprojectionsobtainedfromPCA.<strong>The</strong>valueof?1isinitializedtobethelargerofeither<br />

theresponsibilitiessincethisinvolvescomputingtheexponentialsofthedistancesbetweenthe theL+1eigenvaluefromPCA(representingthevarianceofthedataawayfromthePCAplane)<br />

projectedlatentpointsandthedatapoints,whichmayspanasignicantrangeofvalues. 2.4Summaryofthe<strong>GTM</strong>Algorithm itselfisstraightforwardandissummarizedhereforconvenience. Althoughtheforegoingdiscussionhasbeensomewhatdetailed,theunderlying<strong>GTM</strong>algorithm <strong>GTM</strong>consistsofaconstrainedmixtureofGaussiansinwhichthemodelparametersaredetermined bymaximumlikelihoodusingtheEMalgorithm.Itisdenedbyspecifyingasetofpointsfxigin latentspace,togetherwithasetofbasisfunctionsfj(x)g.<strong>The</strong>adaptiveparametersWand deneaconstrainedmixtureofGaussianswithcentresW(xi)andacommoncovariancematrix re-estimatedusing(12)and(14)respectively.Evaluationoftheloglikelihoodusing(6)attheend givenby?1I.AfterinitializingWand,traininginvolvesalternatingbetweentheE-stepin whichtheposteriorprobabilitiesareevaluatedusing(9),andtheM-stepinwhichWandare ofeachcyclecanbeusedtomonitorconvergence.<br />

Wenowpresentresultsfromtheapplicationofthisalgorithmrsttoatoyprobleminvolving dataintwodimensions,andthentoamorerealisticprobleminvolving12-dimensionaldataarising 3 ExperimentalResults<br />

fromdiagnosticmeasurementsofoilowsalongmulti-phasepipelines.Inbothexampleswechoose thebasisfunctionsj(x)toberadiallysymmetricGaussianswhosecentresaredistributedona uniformgridinx-space,withacommonwidthparameterchosenequaltotwicetheseparationof neighbouringbasisfunctioncentres.Resultsfromatoyproblemforthecaseofa2-dimensional dataspaceanda1-dimensionallatentspaceareshowninFigure4.<br />

Oursecondexamplearisesfromtheproblemofdeterminingthefractionofoilinamulti-phase 3.1OilFlowData<br />

sistsof12measurementstakenfromdual-energygammadensitometersmeasuringtheattenuationpipelinecarryingamixtureofoil,waterandgas(BishopandJames1993).Eachdatapointcon- photonstatistics).<strong>The</strong>threephasesinthepipe(oil,waterandgas)canbelongtooneofthree accuratelytheattenuationprocessesinthepipe,aswellasthepresenceofnoise(arisingfrom dierentgeometricalcongurations,correspondingtolaminar,homogeneous,andannularows, ofgammabeamspassingthroughthepipe.Syntheticallygenerateddataisusedwhichmodels andthedatasetconsistsof1000pointsdrawnwithequalprobabilityfromthe3congurations. Wetakethelatent-variablespacetobetwo-dimensional,sinceourgoalisdatavisualization. Figure5showstheoildatavisualizedinthelatent-variablespaceinwhich,foreachdatapoint, wehaveplottedtheposteriormeanvector.Eachpointhasthenbeenlabelledaccordingtoits


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 9<br />

Figure4:Resultsfromatoyprobleminvolvingdata(`')generatedfroma1-dimensionalcurve Gaussiannoisedistributions(lledcircles).<strong>The</strong>initialconguration,determinedby principalcomponentanalysis,isshownontheleft,andtheconvergedconguration, embeddedin2dimensions,togetherwiththeprojectedlatentpoints(`+')andtheir obtainedafter15iterationsofEM,isshownontheright.<br />

Figure5:<strong>The</strong>leftplotshowstheposterior-meanprojectionoftheoilowdatainthela- plus-signsrepresentstratied,annularandhomogeneousmulti-phasecongurations theclusters.<br />

visualizedusingprincipalcomponentanalysis.Inbothplots,crosses,circlesand respectively.Notehowthenon-linearityof<strong>GTM</strong>givesanimprovedseparationof tentspaceofthe<strong>GTM</strong>model,whiletheplotontherightshowsthesamedataset


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> multi-phaseconguration.Forcomparison,Figure5alsoshowsthecorrespondingresultsobtained 10<br />

4usingprincipalcomponentanalysis. isusefultoconsiderthepreciserelationshipbetween<strong>GTM</strong>andSOM.Wefocusourattentionon Sinceonemotivationfor<strong>GTM</strong>istoprovideaprincipledalternativetotheself-organizingmap,it RelationtotheSelf-OrganizingMap<br />

thebatchversionsofbothalgorithmsasthishelpstomaketherelationshipparticularlyclear. <strong>The</strong>batchversionoftheSOMalgorithm(Kohonen1995)canbedescribedasfollows.AsetofK referencevectorsziisdenedinthedataspace,inwhicheachvectorisassociatedwithanodeon randomvalues,bysettingthemequaltoarandomsubsetofthedatapoints,orbyusingprincipal aregularlatticeina(typically)two-dimensional`featuremap'(analogoustothelatentspaceof <strong>GTM</strong>).<strong>The</strong>algorithmbeginsbyinitializingthereferencevectors(forexamplebysettingthemto componentanalysis).Eachcycleofthealgorithmthenproceedsasfollows.Foreverydatavector tnthecorresponding`winningnode'j(n)isidentied,correspondingtothereferencevectorzj havingthesmallestEuclideandistancekzj?tnk2totn.<strong>The</strong>referencevectorsarethenupdated bysettingthemequaltoweightedaveragesofthedatapointsgivenby<br />

inwhichhijisaneighbourhoodfunctionassociatedwiththeithnode.Thisisgenerallychosento beauni-modalfunctionofthefeaturemapcoordinatescentredonthewinningnode,forexample zi=Pnhij(n)tn Pnhij(n): (21)<br />

functionhijstartswitharelativelylargevalueandisgraduallyreducedaftereachiteration. repeatediteratively.Akeyingredientinthealgorithmisthatthewidthoftheneighbourhood aGaussian.<strong>The</strong>stepsofidentifyingthewinningnodesandupdatingthereferencevectorsare<br />

4.1KernelversusLinearRegression<br />

vectortn.WecanthereforeperformpartialsumsoverthegroupsGjofdatavectorsassignedto AspointedoutbyMulierandCherkassky(1995),thevalueoftheneighbourhoodfunctionhij(n) eachnodej,andhencere-write(21)intheform dependsonlyontheidentityofthewinningnodejandnotonthevalueofthecorrespondingdata<br />

inwhichmjisthemeanofthevectorsingroupGjandisgivenby zi=XjKijmj (22)<br />

Watsonkernelregressionformula(Nadaraya1964;Watson1964)withthekernelfunctionsgiven whereNjisthenumberofdatavectorsingroupGj.<strong>The</strong>result(22)isanalogoustotheNadaraya- mj=1NjXn2Gjtn (23)<br />

by ThusthebatchSOMalgorithmreplacesthereferencevectorsateachcyclewithaconvexcombinationofthenodemeansmj,withcoecientsdeterminedbytheneighbourhoodfunction.Note Kij= Pj0hij0Nj0: hijNj (24)<br />

thatthekernelcoecientssatisfyPjKij=1foreveryi.


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 11<br />

0.04<br />

Figure6:ExampleoftheeectivekernelFijplottedasafunctionofthenodejforagiven<br />

0.02<br />

0.00<br />

Inthe<strong>GTM</strong>algorithm,thecentresy(xi;W)oftheGaussiancomponentscanberegardedas analogoustothe(normalized)neighbourhoodfunctionintheSOMalgorithm. nodei,fortheoilowdatasetafter3iterationsofEM.Thiskernelfunctionis<br />

analogoustothereferencevectorszioftheSOM.Wecanevaluatey(xi;W)bysolvingtheM-step<br />

−0.02<br />

equation(12)tondWandthenusingy(xi;W)=W(xi).Ifwedenetheweightedmeansof<br />

NotethattheeectivekernelsatisesPjFij=1.Toseethis,werstuse(27)toshowthat wherewehaveintroducedtheeectivekernelFijgivenby<br />

PjFijl(xj)=l(xi).<strong>The</strong>nifoneofthebasisfunctionslcorrespondstoabias,sothatl(x)= <strong>The</strong>solutionfory(xi;W)givenby(26)and(27)canbeinterpretedasaweightedleast-squares const:,theresultfollows. regression(Mardia,Kent,andBibby1979)inwhichthe`target'vectorsarethei,andthe weightingcoecientsaregivenbyGjj. Figure6showsanexampleoftheeectivekernelfor<strong>GTM</strong>correspondingtotheoilowproblem discussedinSection3. From(22)and(26)weseethatboth<strong>GTM</strong>andSOMcanberegardedasformsofkernelsmoothers. However,therearetwokeydierences.<strong>The</strong>rstisthatinSOMthevectorswhicharesmoothed, denedby(23),correspondtohardassignmentsofdatapointstonodes,whereasthecorresponding vectorsin<strong>GTM</strong>,givenby(25),involvesoftassignments,weightedbytheposteriorprobabilities. ThisisanalogoustothedistinctionbetweenK-meansclustering(hardassignments)andttinga standardGaussianmixturemodelusingEM(softassignments). <strong>The</strong>secondkeydierenceisthatthekernelfunctioninSOMismadetoshrinkduringthecourseof inlatentspace,foragivendatapoint,formsalocalised`bubble'andtheradiusofthisbubble thealgorithminanarbitrary,hand-craftedmanner.In<strong>GTM</strong>theposteriorprobabilitydistribution shrinksautomaticallyduringtraining,asshowninFigure7.Thisresponsibilitybubblegoverns theextenttowhichindividualdatapointscontributetowardsthevectorsiin(25)andhence towardstheupdatingoftheGaussiancentresy(xi;W)via(26).<br />

thenweobtain thedatavectorsby<br />

y(xi;W)=XjFijj i=PnRintn PnRin (25) (26)<br />

Fij=T(xi) TG?1(xj)Gjj: (27)<br />

0.08<br />

0.06


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 12<br />

Figure7:Examplesoftheposteriorprobabilities(responsibilities)Rinofthelatentspace andareplottedusinganon-linearscalingoftheformp(xjtn)0:1tohighlightthevariationoverthelatentspace.Noticehowtheresponsibility`bubble',whichgoverns singledatapointfromthetrainingsetintheoil-owproblemdiscussedinSection3, pointsatanearlystage(left),intermediatestage(centre)andlatestage(right) duringtheconvergenceofthe<strong>GTM</strong>algorithm.<strong>The</strong>sehavebeenevaluatedfora<br />

4.2Comparisonof<strong>GTM</strong>withSOM y(xi;W),shrinksautomaticallyduringthelearningprocess. theupdatingoftheweightmatrix,andhencetheupdatingofthedata-spacevectors<br />

<strong>The</strong>mostsignicantdierencebetweenthe<strong>GTM</strong>andSOMalgorithmsisthat<strong>GTM</strong>denesan explicitprobabilitydensitygivenbythemixturedistributionin(5).Asaconsequencethereis awell-denedobjectivefunctiongivenbytheloglikelihood(6),andconvergencetoa(local) maximumoftheobjectivefunctionisguaranteedbytheuseoftheEMalgorithm(Dempster, likelihoodofatestsetunderthegenerativedistributionsoftherespectivemodels.FortheSOM parameters,andeventocomparea<strong>GTM</strong>solutionwithanotherdensitymodel,byevaluatingthe Laird,andRubin1977).Thisalsoprovidesadirectmeanstocomparedierentchoicesofmodel algorithm,however,thereisnoprobabilitydensityandnowell-denedobjectivefunctionwhich isbeingminimizedbythetrainingprocess.Indeedithasbeenproven(Erwin,Obermayer,and Schulten1992)thatsuchanobjectivefunctioncannotexistfortheSOM. AfurtherlimitationoftheSOM,highlightedinKohonen(1995,page234),isthattheconditions underwhichso-called`self-organization'oftheSOMoccurshavenotbeenquantied,andsoin practiceitisnecessarytoconrmempiricallythatthetrainedmodeldoesindeedhavethedesired spatialordering.Incontrast,theneighbourhood-preservingnatureofthe<strong>GTM</strong>mappingisan automaticconsequenceofthechoiceofacontinuousfunctiony(x;W). bourhoodfunctionandbythewayinwhichitischangedduringthecourseofthealgorithm,andSimilarly,thesmoothnesspropertiesoftheSOMaredeterminedindirectlybythechoiceofneigh- isthereforediculttocontrol.Thus,priorknowledgeabouttheformofthemapcannoteasily bespecied.<strong>The</strong>priordistributionfor<strong>GTM</strong>,however,canbecontrolleddirectly,andproperties suchassmoothnessaregovernedexplicitlybybasisfunctionparameters,asillustratedinFigure3. Finally,weconsidertherelativecomputationalcostsofthe<strong>GTM</strong>andSOMalgorithms.For problemsinvolvingdatainhigh-dimensionalspacesthedominantcomputationalcostof<strong>GTM</strong> arisesfromtheevaluationoftheEuclideandistancesfromeverydatapointtoeveryGaussiancentre y(xi;W).SinceexactlythesamecalculationsmustbedoneforSOM(involvingthedistancesof datapointsfromthereferencevectorsi)weexpectoneiterationofeitheralgorithmtotake approximatelythesametime.Anempiricalcomparisonofthecomputationalcostof<strong>GTM</strong>and SOMwasobtainedbyrunningeachalgorithmontheoilowdatauntil`convergence'(dened asnodiscerniblechangeintheappearanceofthevisualizationmap).<strong>The</strong><strong>GTM</strong>algorithmtook


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 1058sec.(40iterations)whilethebatchSOMtook1011sec.(25iterations)usingaGaussian 13<br />

vectorisupdatedateachiterationusingonlydatapointsassociatedwithnearbyreferencevectors, neighbourhoodfunction.Withasimple`top-hat'neighbourhoodfunction,inwhicheachreference theCPUtimefortheSOMalgorithmisreducedto305sec.(25iterations).Onepotentialadvantage of<strong>GTM</strong>inpracticalapplicationsarisesfromareductioninthenumberofexperimentaltraining runsneededsincebothconvergenceandtopographicorderingareguaranteed.<br />

wereviewbrieythemostsignicantofthese. <strong>The</strong>reareseveralalgorithmsinthepublishedliteraturewhichhavecloselinkswith<strong>GTM</strong>.Here 5 RelationtoOtherAlgorithms<br />

<strong>The</strong>elasticnetalgorithmofDurbinandWillshaw(1987)canbeviewedasaGaussianmixture ofGaussianscorrespondingtoneighbouringpointsalongthe(typicallyone-dimensional)chainto densitymodel,ttedbypenalizedmaximumlikelihood.<strong>The</strong>penaltytermencouragesthecentres becloseindataspace.Itdiersfrom<strong>GTM</strong>inthatitdoesnotdeneacontinuousdataspace manifold.Also,thetrainingalgorithmgenerallyinvolvesahand-craftedannealingoftheweight<br />

ofprojectionfollowedbysmoothing,althoughthesearenotgenerativemodels.Itisinterestingto Stuetzle1989;LeBlancandTibshirani1994)whichagaininvolveatwo-stagealgorithmconsisting <strong>The</strong>rearealsosimilaritiesbetween<strong>GTM</strong>andprincipalcurvesandprincipalsurfaces(Hastieand penaltycoecient.<br />

notethatHastieandStuetzle(1989)proposereducingthespatialwidthofthesmoothingfunction duringlearning,inamanneranalogoustotheshrinkingoftheneighbourhoodfunctioninthe distributionbasedonamixtureofGaussians,withawell-denedlikelihoodfunction,andistrained SOM.Amodiedformoftheprincipalcurvesalgorithm(Tibshirani1992)introducesagenerative<br />

<strong>The</strong>techniqueofparametrizedself-organizingmaps(PSOMs)involvesrstttingastandardSOM bytheEMalgorithm.However,thenumberofGaussiancomponentsisequaltothenumberof<br />

modeltoadatasetandthenndingamanifoldindataspacewhichinterpolatesthereference aderivative-basedregularizationterm. datapoints,andsmoothingisimposedbypenalizingthelikelihoodfunctionwiththeadditionof<br />

vectors(Ritter1993).Althoughthisdenesacontinuousmanifold,theinterpolatingsurface inSection4.2,remain. <strong>The</strong>SOMhasalsobeenusedforvectorquantization.Inthiscontextithasbeenshownhow doesnotformpartofthetrainingalgorithm,andthebasicproblemsinusingSOM,discussed<br />

inlatentspacetoacomplexdistributionindataspacebypropagationthroughanon-linearnetwork. Finally,the`densitynetwork'modelofMacKay(1995)involvestransformingasimpledistribution are-formulationofthevectorquantizationproblem(Luttrell1990;BuhmannandKuhnel1993;<br />

Adiscretedistributioninlatentspaceisagainused,whichisinterpretedasanapproximateMonte Luttrell1994)canavoidmanyoftheproblemswiththeSOMprocedurediscussedearlier.<br />

areadaptedusingEM.<br />

regularratherthanstochastic,aspecicformofnon-linearityisused,andthemodelparameters Carlointegrationoverthelatentvariablesneededtodenethedataspacedistribution.<strong>GTM</strong> canbeseenasaparticularinstanceofthisframeworkinwhichthesamplingoflatentspaceis


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 6 Discussion 14<br />

Inthispaperwehaveintroducedaformofnon-linearlatentvariablemodelwhichcanbetrained ecientlyusingtheEMalgorithm.Viewedasatopographicmappingalgorithm,ithasthekey propertythatitdenesaprobabilitydensitymodel. problemofdealingwithmissingvaluesinthedataset(inwhichsomecomponentsofthedata vectorstnareunobserved).Ifthemissingvaluesaremissingatrandom(LittleandRubin1987) Asanexampleofthesignicanceofhavingaprobabilitydensity,considertheimportantpractical thenthelikelihoodfunctionisobtainedbyintegratingouttheunobservedvalues.Forthe<strong>GTM</strong><br />

amixtureof<strong>GTM</strong>models.Inthiscasetheoveralldensitycanbewrittenas modeltheintegrationscanbeperformedanalytically,leadingtoasimplemodicationoftheEM algorithm. Afurtherconsequenceofhavingaprobabilisticapproachisthatitisstraightforwardtoconsider<br />

mixingcoecientssatisfying0 wherep(tjr)representstherthmodel,withitsownsetofindependentparameters,andP(r)are P(r) p(t)=XrP(r)p(tjr) 1andPrP(r)=1.Again,itisstraightforwardto (28)<br />

<strong>The</strong><strong>GTM</strong>algorithmcanbeextendedinotherways,forinstancebyallowingindependentmix- extendtheEMalgorithmtomaximizethecorrespondinglikelihoodfunction.<br />

exponentialappliedtoageneralizedlinearregressionmodel,althoughinthiscasetheM-stepof theEMalgorithmwouldinvolvenon-linearoptimization.Similarly,theinversenoisevariance rameters,theicanbedeterminedassmoothfunctionsofthelatentvariablesusinganormalizedingcoecientsi(priorprobabilities)foreachoftheGaussiancomponents,whichagaincanbeestimatedbyastraightforwardextensionoftheEMalgorithm.Insteadofbeingindependentpa- smoothmanifoldindataspace,whichallowsthelocal`magnicationfactor'betweenlatentand dataspacetobeevaluatedasafunctionofthelatentspacecoordinatesusingthetechniquesof dierentialgeometry(Bishop,Svensen,andWilliams1996b).Finally,sincethereisawell-dened likelihoodfunction,itisstraightforwardinprincipletointroducepriorsoverthemodelparameters canbegeneralizedtoafunctionofx.Animportantpropertyof<strong>GTM</strong>istheexistenceofa<br />

(asdiscussedinSection2.1)andtouseBayesiantechniquesinplaceofmaximumlikelihood. bemoreconvenienttoconsidersequentialadaptationinwhichdatapointsarepresentedoneata Throughoutthispaperwehavefocussedonthebatchversionofthe<strong>GTM</strong>algorithminwhichall ofthetrainingdataareusedtogethertoupdatethemodelparameters.Insomeapplicationsitwill time.Sinceweareminimizingadierentiablecostfunction,givenby(6),asequentialalgorithm canbeobtainedbyappealingtotheRobbins-Monroprocedure(RobbinsandMonro1951;Bishop Awebsitefor<strong>GTM</strong>isprovidedat: algorithmcanbeused(Titterington,Smith,andMakov1985). 1995)tondazerooftheobjectivefunctiongradient.Alternatively,asequentialformoftheEM<br />

whichincludespostscriptlesofrelevantpapers,asoftwareimplementationinMatlab(aCimplementationisunderdevelopment),andexampledatasetsusedinthedevelopmentofthe<strong>GTM</strong> http://www.ncrg.aston.ac.uk/<strong>GTM</strong>/<br />

algorithm.


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> Acknowledgements 15<br />

High-DimensionalData.WewouldliketothankGeoreyHinton,IainStrachanandMichael ThisworkwassupportedbyEPSRCgrantGR/K51808:NeuralNetworksforVisualizationof inStockholmfortheirhospitalityduringpartofthisproject. Tippingforusefuldiscussions.MarkusSvensenwouldliketothankthestaoftheSANSgroup<br />

References Bishop,C.M.(1995).NeuralNetworksforPatternRecognition.Oxford<strong>University</strong>Press.<br />

Bishop,C.M.,M.Svensen,andC.K.I.Williams(1996a).AfastEMalgorithmforlatentvariable Bishop,C.M.andG.D.James(1993).Analysisofmultiphaseowsusingdual-energy ResearchA327,580{593. gammadensitometryandneuralnetworks.NuclearInstrumentsandMethodsinPhysics<br />

Bishop,C.M.,M.Svensen,andC.K.I.Williams(1996b).Magnicationfactorsforthe<strong>GTM</strong> densitymodels.InD.S.Touretzky,M.C.Mozer,andM.E.Hasselmo(Eds.),Advancesin NeuralInformationProcessingSystems,Volume8,pp.465{471.MITPress.<br />

Buhmann,J.andK.Kuhnel(1993).Vectorquantizationwithcomplexitycosts.IEEETransac- algorithm.ToappearinProceedingsFifthIEEInternationalConferenceonArticialNeural<br />

Dempster,A.P.,N.M.Laird,andD.B.Rubin(1977).Maximumlikelihoodfromincomplete Networks. tionsonInformation<strong>The</strong>ory39(4),1133{1145.<br />

Erwin,E.,K.Obermayer,andK.Schulten(1992).Self-organizingmaps:ordering,convergence Durbin,R.andD.Willshaw(1987).Ananalogueapproachtothetravellingsalesmanproblem. Nature326,689{691. dataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety,B39(1),1{38.<br />

Hastie,T.andW.Stuetzle(1989).Principalcurves.JournaloftheAmericanStatisticalAssoHinton,G.E.,C.K.I.Williams,andM.D.Revow(1992).Adaptiveelasticmodelsforhand- propertiesandenergyfunctions.BiologicalCybernetics67,47{55. ciation84(406),502{516.printedcharacterrecognition.InJ.E.Moody,S.J.Hanson,andR.P.Lippmann(Eds.),AdvancesinNeuralInformationProcessingSystems,Volume4,pp.512{519.MorganKau- Kohonen,T.(1995).Self-OrganizingMaps.Berlin:Springer-Verlag. Kohonen,T.(1982).Self-organizedformationoftopologicallycorrectfeaturemaps.Biological Cybernetics43,59{69. mann.<br />

Little,R.J.A.andD.B.Rubin(1987).StatisticalAnalysiswithMissingData.NewYork:John LeBlanc,M.andR.Tibshirani(1994).Adaptiveprincipalsurfaces.JournaloftheAmerican StatisticalAssociation89(425),53{64.<br />

Luttrell,S.P.(1994).ABayesiananalysisofself-organizingmaps.NeuralComputation6(5), Luttrell,S.P.(1990).Derivationofaclassoftrainingalgorithms.IEEETransactionsonNeural Networks1(2),229{232. Wiley.<br />

MacKay,D.J.C.(1995).Bayesianneuralnetworksanddensitynetworks.NuclearInstruments Mardia,K.,J.Kent,andM.Bibby(1979).Multivariateanalysis.AcademicPress.<br />

767{794. andMethodsinPhysicsResearch,A354(1),73{80.


<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> Mulier,F.andV.Cherkassky(1995).Self-organizationasaniterativekernelsmoothingprocess. 16<br />

Nadaraya,E.A.(1964).Onestimatingregression.<strong>The</strong>oryofProbabilityanditsApplica- Ritter,H.(1993).Parametrizedself-organizingmaps.InProceedingsICANN'93International NeuralComputation7(6),1165{1177. tions9(1),141{142.<br />

Tibshirani,R.(1992).Principalcurvesrevisited.StatisticsandComputing2,183{190. Robbins,H.andS.Monro(1951).Astochasticapproximationmethod.AnnalsofMathematical ConferenceonArticialNeuralNetworks,Amsterdam,pp.568{575.Springer-Verlag.<br />

Titterington,D.M.,A.F.M.Smith,andU.E.Makov(1985).StatisticalAnalysisofFinite Statistics22,400{407.<br />

Watson,G.S.(1964).Smoothregressionanalysis.Sankhya:<strong>The</strong>IndianJournalofStatistics. SeriesA26,359{372.<br />

MixtureDistributions.NewYork:JohnWiley.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!