GTM: The Generative Topographic Mapping - Aston University ...
GTM: The Generative Topographic Mapping - Aston University ...
GTM: The Generative Topographic Mapping - Aston University ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
DeptofComputerScience&AppliedMathematics NeuralComputingResearchGroup BirminghamB47ET <strong>Aston</strong><strong>University</strong><br />
http://www.ncrg.aston.ac.uk/ Fax:+44(0)1213334586 Tel:+44(0)1213334631 UnitedKingdom<br />
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><br />
ChristopherM.Bishop <strong>Topographic</strong><strong>Mapping</strong><br />
C.M.Bishop@aston.ac.uk ChristopherK.I.Williams svensjfm@aston.ac.uk MarkusSvensen<br />
TechnicalReportNCRG/96/015C.K.I.Williams@aston.ac.uk AcceptedforpublicationinNeuralComputation. April16,1997<br />
Latentvariablemodelsrepresenttheprobabilitydensityofdatainaspaceofseveraldimensions whichisbasedonalineartransformationsbetweenthelatentspaceandthedataspace.Inthis intermsofasmallernumberoflatent,orhidden,variables.Afamiliarexampleisfactoranalysis Abstract<br />
paperweintroduceaformofnon-linearlatentvariablemodelcalledthe<strong>Generative</strong><strong>Topographic</strong> <strong>Mapping</strong>forwhichtheparametersofthemodelcanbedeterminedusingtheEMalgorithm.<strong>GTM</strong> providesaprincipledalternativetothewidelyusedSelf-OrganizingMap(SOM)ofKohonen(1982), andovercomesmostofthesignicantlimitationsoftheSOM.Wedemonstratetheperformanceof the<strong>GTM</strong>algorithmonatoyproblemandonsimulateddatafromowdiagnosticsforamulti-phase oilpipeline.
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 1 Introduction 2<br />
structureistomodelthedistributionofthedataintermsoflatent,orhidden,variables.A Manydatasetsexhibitsignicantcorrelationsbetweenthevariables.Onewaytocapturesuch familiarexampleofthisapproachisfactoranalysis,whichisbasedonalineartransformation beextendedtoallownon-lineartransformationswhileremainingcomputationallytractable.This fromlatentspacetodataspace.Inthispaperweshowhowthelatentvariableframeworkcan algorithm. Oneofthemotivationsforthisworkistoprovideaprincipledalternativetothewidelyused`self- mixtureofGaussianswhoseparameterscanbeoptimizedusingtheEM(expectation-maximization) leadstothe<strong>GTM</strong>(<strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong>)algorithm,whichisbasedonaconstrained<br />
organizingmap'(SOM)algorithm(Kohonen1982)inwhichasetofunlabelleddatavectorstn (n=1;:::;N)inaD-dimensionaldataspaceissummarizedintermsofasetofreferencevectors havingaspatialorganizationcorrespondingtoa(generally)two-dimensionalsheet.Whilethis algorithmhasachievedmanysuccessesinpracticalapplications,italsosuersfromsomesignicant costfunction,thelackofatheoreticalbasisforchoosinglearningrateparameterschedulesand neighbourhoodparameterstoensuretopographicordering,theabsenceofanygeneralproofsof deciencies,manyofwhicharehighlightedinKohonen(1995).<strong>The</strong>yinclude:theabsenceofa convergence,andthefactthatthemodeldoesnotdeneaprobabilitydensity.<strong>The</strong>seproblemscan<br />
usedinvisualizationareregardedasdeningaprojectionfromtheD-dimensionaldataspace Animportantapplicationoflatentvariablemodelsistodatavisualization.Manyofthemodels overcomesmostofthelimitationsoftheSOMwhileintroducingnosignicantdisadvantages. allbetracedtotheheuristicoriginsoftheSOMalgorithm1.Weshowthatthe<strong>GTM</strong>algorithm<br />
ontoatwo-dimensionalvisualizationspace.Weshallseethat,bycontrast,the<strong>GTM</strong>modelis datavisualization,themappingistheninvertedusingBayes'theorem,givingrisetoaposterior denedintermsofamappingfromthelatentspaceintothedataspace.Forthepurposesof distributioninlatentspace. 2<strong>The</strong>goalofalatentvariablemodelistondarepresentationforthedistributionp(t)ofdataina D-dimensionalspacet=(t1;:::;tD)intermsofanumberLoflatentvariablesx=(x1;:::;xL). LatentVariables<br />
Thisisachievedbyrstconsideringafunctiony(x;W)whichmapspointsxinthelatentspace intocorrespondingpointsy(x;W)inthedataspace.<strong>The</strong>mappingisgovernedbyamatrix ofparametersW,andcouldconsist,forexample,ofafeed-forwardneuralnetworkinwhich caseWwouldrepresenttheweightsandbiases.Weareinterestedinthesituationinwhichthe dimensionalityLofthelatent-variablespaceislessthanthedimensionalityDofthedataspace, sincewewishtocapturethefactthatthedataitselfhasanintrinsicdimensionalitywhichisless thanD.<strong>The</strong>transformationy(x;W)thenmapsthelatent-variablespaceintoanL-dimensional non-EuclideanmanifoldSembeddedwithinthedataspace2.Thisisillustratedschematicallyfor thecaseofL=2andD=3inFigure1. Ifwedeneaprobabilitydistributionp(x)onthelatent-variablespace,thiswillinduceacorre-<br />
goalhereisnotneuro-biologicalmodelling,butratherthedevelopmentofeectivealgorithmsfordataanalysis,for spondingdistributionp(yjW)inthedataspace.Weshallrefertop(x)asthepriordistributionof<br />
whichbiologicalrealismneednotbeconsidered. xforreasonswhichwillbecomeclearshortly.SinceL
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 3<br />
y(x;W)<br />
x2 x<br />
t<br />
connedtotheL-dimensionalmanifoldandhencewouldbesingular.Sinceinrealitythedatawill Figure1:<strong>The</strong>non-linearfunctiony(x;W)denesamanifoldSembeddedindataspacegiven bytheimageofthelatent-variablespaceunderthemappingx!y.<br />
2<br />
1<br />
t3 onlyapproximatelyliveonalower-dimensionalmanifold,itisappropriatetoincludeanoisemodel forthetvector.Wechoosethedistributionoft,forgivenxandW,tobearadially-symmetric Gaussiancentredony(x;W)havingvariance?1sothat<br />
Notethatothermodelsforp(tjx)mightalsobeappropriate,suchasaBernoulliforbinaryvariables (withasigmoidtransformationofy)oramultinomialformutuallyexclusiveclasses(witha p(tjx;W;)=2D=2exp?2ky(x;W)?tk2: (1)<br />
`softmax',ornormalizedexponentialtransformationofy(Bishop1995)),orevencombinationsof these.<strong>The</strong>distributionint-space,foragivenvalueofW,isthenobtainedbyintegrationoverthe<br />
W,andtheinversevariance,usingmaximumlikelihood.Inpracticeitisconvenienttomaximize ForagivenadatasetD=(t1;:::;tN)ofNdatapoints,wecandeterminetheparametermatrix<br />
Oncewehavespeciedthepriordistributionp(x)andthefunctionalformofthemappingy(x;W), wecaninprincipledetermineWandbymaximizingL(W;).However,theintegraloverxin (2)will,ingeneral,beanalyticallyintractable.Ifwechoosey(x;W)tobealinearfunctionof W,andwechoosep(x)tobeGaussian,thentheintegralbecomesaconvolutionoftwoGaussians whichisitselfaGaussian.Foranoisedistributionp(tjx)whichisGaussianwithadiagonal themaximumlikelihoodsolutionforWhascolumnsgivenbythescaledprincipaleigenvectors. symmetricGaussiangivenby(1)themodeliscloselyrelatedtoprincipalcomponentanalysissince covariancematrix,weobtainthestandardfactoranalysismodel.Inthecaseoftheradially Herewewishtoextendthisformalismtonon-linearfunctionsy(x;W),andinparticulartodevelop amodelwhichissimilarinspirittotheSOMalgorithm.Wethereforeconsideraspecicformfor p(x)givenbyasumofdeltafunctionscentredonthenodesofaregulargridinlatentspace<br />
x-distribution p(tjW;)=Zp(tjx;W;)p(x)dx: (2)<br />
theloglikelihood,givenby L(W;)=lnNYn=1p(tnjW;): (3)<br />
t 1<br />
p(x)=1KKXi=1(x?xi) (4)<br />
S
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 4<br />
y(x;W)<br />
x2 Figure2:InordertoformulatealatentvariablemodelwhichissimilarinspirittotheSOM, weconsiderapriordistributionp(x)consistingofasuperpositionofdeltafunctions,<br />
t2 x1 t<br />
Gaussiandistribution. locatedatthenodesofaregulargridinlatentspace.Eachnodexiismappedtoa correspondingpointy(xi;W)indataspace,andformsthecentreofacorresponding<br />
3<br />
inwhichcasetheintegralin(2)canagainbeperformedanalytically.Eachpointxiisthenmapped toacorrespondingpointy(xi;W)indataspace,whichformsthecentreofaGaussiandensity function,asillustratedinFigure2.From(2)and(4)weseethatthedistributionfunctionindata<br />
andtheloglikelihoodfunctionbecomes<br />
Fortheparticularnoisemodelp(tjx;W;)givenby(1),thedistributionp(tjW;)corresponds toaconstrainedGaussianmixturemodel(Hinton,Williams,andRevow1992)sincethecentres oftheGaussians,givenbyy(xi;W),cannotmoveindependentlybutarerelatedthroughthe functiony(x;W).Notethat,providedthemappingfunctiony(x;W)issmoothandcontinuous, theprojectedpointsy(xi;W)willnecessarilyhaveatopographicorderinginthesensethatany twopointsxAandxBwhicharecloseinlatentspacewillmaptopointsy(xA;W)andy(xB;W) whicharecloseindataspace. 2.1<strong>The</strong>EMAlgorithm Ifwenowchooseaparticularparametrizedformfory(x;W)whichisadierentiablefunctionof W(forexample,afeed-forwardnetworkwithsigmoidalhiddenunits)thenwecanusestandard techniquesfornon-linearoptimization,suchasconjugategradientsorquasi-Newtonmethods,to EM(expectation-maximization)algorithm(Dempster,Laird,andRubin1977;Bishop1995).By However,ourmodelconsistsofamixturedistributionwhichsuggeststhatwemightseekan makingasuitablechoiceofmodely(x;W)wewillseethattheM-stepcorrespondstothesolution ndaweightmatrixW,andaninversevariance,whichmaximizeL(W;).<br />
ofasetoflinearequations.Inparticularweshallchoosey(x;W)tobegivenbyageneralized<br />
spacethentakestheform p(tjW;)=1KKXi=1p(tjxi;W;) (5)<br />
L(W;)=NXn=1ln(1KKXi=1p(tnjxi;W;)): (6)<br />
t 1
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> linearregressionmodeloftheform 5<br />
wheretheelementsof(x)consistofMxedbasisfunctionsj(x),andWisaDMmatrix. Generalizedlinearregressionmodelspossessthesameuniversalapproximationcapabilitiesasmultilayeradaptivenetworks,providedthebasisfunctionsj(x)arechosenappropriately.<strong>The</strong>usual y(x;W)=W(x) (7)<br />
limitationofsuchmodels,however,isthatthenumberofbasisfunctionsmusttypicallygrow exponentiallywiththedimensionalityLoftheinputspace(Bishop1995).Inthepresentcontext thisisnotasignicantproblemsincethedimensionalityisgovernedbythenumberoflatent variablevariableswhichwilltypicallybesmall.Infactfordatavisualizationapplicationswe componentwhichgeneratedeachdatapointtnisunknown.WecanformulatetheEMalgorithm forthismodelasfollows.First,supposethat,atsomepointinthealgorithm,thecurrentweight generallyuseL=2. <strong>The</strong>maximizationof(6)canberegardedasamissing-dataprobleminwhichtheidentityiofthe matrixisgivenbyWoldandthecurrentinversenoisevarianceisgivenbyold.IntheE-step weuseWoldandoldtoevaluatetheposteriorprobabilities,orresponsibilities,ofeachGaussian componentiforeverydatapointtnusingBayes'theoremintheform Rin(Wold;old)=p(xijtn;Wold;old) = XKi0=1p(tnjxi0;Wold;old): p(tnjxi;Wold;old) (9) (8)<br />
Wenowconsidertheexpectationofthecomplete-dataloglikelihoodintheform<br />
Maximizing(10)withrespecttoW,andusing(1)and(7),weobtain hLcomp(W;)i=NXn=1KXi=1Rin(Wold;old)lnfp(tnjxi;W;)g: NXn=1KXi=1Rin(Wold;old)fWnew(xi)?tngT(xi)=0:<br />
(10)<br />
Thiscanconvenientlybewritteninmatrixnotationintheform TGoldWTnew=TRoldT<br />
(11)<br />
RisaKNmatrixwithelementsRin,andGisaKKdiagonalmatrixwithelements whereisaKMmatrixwithelementsij=j(xi),TisaNDmatrixwithelementstnk, (12)<br />
Wecannowsolve(12)forWnewusingstandardmatrixinversiontechniques,basedonsingular valuedecompositiontoallowforpossibleill-conditioning.Notethatthematrix Gii=NXn=1Rin(W;): isconstant (13)<br />
Similarly,maximizing(10)withrespecttoweobtainthefollowingre-estimationformula throughoutthealgorithm,andsoneedonlybeevaluatedonceatthestart.<br />
<strong>The</strong>EMalgorithmalternatesbetweentheE-step,correspondingtotheevaluationoftheposterior 1new=1 NDNXn=1KXi=1Rin(Wold;old)kWnew(xi)?tnk2: (14)<br />
probabilitiesin(9),andtheM-step,givenbythesolutionof(12)and(14).Jensen'sinequality
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> canbeusedtoshowthat,ateachiterationofthealgorithm,theobjectivefunctionwillincrease 6<br />
unlessitisalreadyata(local)maximum,asdiscussedforexampleinBishop(1995).Typically theEMalgorithmgivessatisfactoryconvergenceafterafewtensofcycles,particularlysincewe areprimarilyinterestedinconvergenceofthedistributionandthisisoftenachievedmuchmore rapidlythanconvergenceoftheparametersthemselves. Ifdesired,aregularizationtermcanbeaddedtotheobjectivefunctiontocontrolthemapping y(x;W).ThiscanbeinterpretedasaMAP(maximuma-posteriori)estimatorcorrespondingtoa choiceofpriorovertheweightsW.Inthecaseofaradially-symmetricGaussianprioroftheform<br />
whereistheregularizationcoecient,thisleadstoamodicationoftheM-step(12)togive p(Wj)=2MD=2exp8
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 7<br />
Figure3:ExamplesofmanifoldsgeneratedbysamplingfromthepriordistributionoverW left-handplot(wheresisthespacingofthebasisfunctioncentres),and=2s intheright-handplot.Dierentvaluesofsimplyaectthelinearscalingofthe ofthemanifold.HerethebasisfunctionsareGaussianwithwidth=4sinthe givenby(15),showingtheeectofthechoiceofbasisfunctionsonthesmoothness<br />
isitselfpartofthisprior.Inthe<strong>GTM</strong>algorithm,thepriordistributionovermappingfunctions y(x;W)isgovernedbytheprioroverweightsW,givenforexampleby(15),aswellasbythe embeddedmanifold.<br />
basisfunctions.Wetypicallychoosethebasisfunctionsj(x)toberadiallysymmetricGaussians whosecentresaredistributedonauniformgridinx-space,withacommonwidthparameter, whosevalue,alongwiththenumberandspacingofthebasisfunctions,determinesthesmoothness ofthemanifold.ExamplesofsurfacesgeneratedbysamplingthepriorareshowninFigure3. Inadditiontothebasisfunctionsi(x),itisalsonecessarytoselectthelatent-spacesamplepoints thentheGaussianmixturecentresindataspacebecomerelativelyindependentandthedesired dicultybeyondincreasedcomputationalcost.Inparticular,thereisno`over-tting'ifthenumber fxig.Notethat,iftherearetoofewsamplepointsinrelationtothenumberofbasisfunctions, smoothnesspropertiescanbelost.Havingalargenumberofsamplepoints,however,causesno ofsamplepointsisincreasedsincethenumberofdegreesoffreedominthemodeliscontrolledby themappingfunctiony(x;W).Onewaytoviewtheroleofthelatentspacesamplesfxigisas aMonteCarloapproximationtotheintegraloverxin(2)(MacKay1995;Bishop,Svensen,and function. ofatwo-dimensionallatentspace,O(100)samplepointsliewithin2ofthecentreofeachbasis Williams1996a).<strong>The</strong>choiceofthenumberKandlocationofthesamplepointsxiinlatentspace<br />
Notethatwehaveconsideredthebasisfunctionparameters(widthsandlocations)tobexed,with isnotcritical,andwetypicallychooseGaussianbasisfunctionsandsetKsothat,inthecase<br />
aGaussianpriorontheweightmatrixW.Inprinciple,priorsoverthebasisfunctionparameters couldalsobeintroduced,andthesecouldagainbetreatedbyMAPestimationorbyBayesian analysis.Todothis,werstevaluatethedatacovariancematrixandobtaintherstandsecond integration. WeinitializetheparametersWsothatthe<strong>GTM</strong>modelinitiallyapproximatesprincipalcomponent principaleigenvectors,andthenwedetermineWbyminimizingtheerrorfunction<br />
wherethecolumnsofUaregivenbytheeigenvectors.Thisrepresentsthesum-of-squareserror<br />
E=12XikW(xi)?Uxik (20)
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> betweentheprojectionsofthelatentpointsintodataspacebythe<strong>GTM</strong>modelandthecorre- 8<br />
Finally,wenotethatinanumericalimplementationcaremustbetakenovertheevaluationof orthesquareofhalfofthegridspacingofthePCA-projectedlatentpointsindataspace. spondingprojectionsobtainedfromPCA.<strong>The</strong>valueof?1isinitializedtobethelargerofeither<br />
theresponsibilitiessincethisinvolvescomputingtheexponentialsofthedistancesbetweenthe theL+1eigenvaluefromPCA(representingthevarianceofthedataawayfromthePCAplane)<br />
projectedlatentpointsandthedatapoints,whichmayspanasignicantrangeofvalues. 2.4Summaryofthe<strong>GTM</strong>Algorithm itselfisstraightforwardandissummarizedhereforconvenience. Althoughtheforegoingdiscussionhasbeensomewhatdetailed,theunderlying<strong>GTM</strong>algorithm <strong>GTM</strong>consistsofaconstrainedmixtureofGaussiansinwhichthemodelparametersaredetermined bymaximumlikelihoodusingtheEMalgorithm.Itisdenedbyspecifyingasetofpointsfxigin latentspace,togetherwithasetofbasisfunctionsfj(x)g.<strong>The</strong>adaptiveparametersWand deneaconstrainedmixtureofGaussianswithcentresW(xi)andacommoncovariancematrix re-estimatedusing(12)and(14)respectively.Evaluationoftheloglikelihoodusing(6)attheend givenby?1I.AfterinitializingWand,traininginvolvesalternatingbetweentheE-stepin whichtheposteriorprobabilitiesareevaluatedusing(9),andtheM-stepinwhichWandare ofeachcyclecanbeusedtomonitorconvergence.<br />
Wenowpresentresultsfromtheapplicationofthisalgorithmrsttoatoyprobleminvolving dataintwodimensions,andthentoamorerealisticprobleminvolving12-dimensionaldataarising 3 ExperimentalResults<br />
fromdiagnosticmeasurementsofoilowsalongmulti-phasepipelines.Inbothexampleswechoose thebasisfunctionsj(x)toberadiallysymmetricGaussianswhosecentresaredistributedona uniformgridinx-space,withacommonwidthparameterchosenequaltotwicetheseparationof neighbouringbasisfunctioncentres.Resultsfromatoyproblemforthecaseofa2-dimensional dataspaceanda1-dimensionallatentspaceareshowninFigure4.<br />
Oursecondexamplearisesfromtheproblemofdeterminingthefractionofoilinamulti-phase 3.1OilFlowData<br />
sistsof12measurementstakenfromdual-energygammadensitometersmeasuringtheattenuationpipelinecarryingamixtureofoil,waterandgas(BishopandJames1993).Eachdatapointcon- photonstatistics).<strong>The</strong>threephasesinthepipe(oil,waterandgas)canbelongtooneofthree accuratelytheattenuationprocessesinthepipe,aswellasthepresenceofnoise(arisingfrom dierentgeometricalcongurations,correspondingtolaminar,homogeneous,andannularows, ofgammabeamspassingthroughthepipe.Syntheticallygenerateddataisusedwhichmodels andthedatasetconsistsof1000pointsdrawnwithequalprobabilityfromthe3congurations. Wetakethelatent-variablespacetobetwo-dimensional,sinceourgoalisdatavisualization. Figure5showstheoildatavisualizedinthelatent-variablespaceinwhich,foreachdatapoint, wehaveplottedtheposteriormeanvector.Eachpointhasthenbeenlabelledaccordingtoits
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 9<br />
Figure4:Resultsfromatoyprobleminvolvingdata(`')generatedfroma1-dimensionalcurve Gaussiannoisedistributions(lledcircles).<strong>The</strong>initialconguration,determinedby principalcomponentanalysis,isshownontheleft,andtheconvergedconguration, embeddedin2dimensions,togetherwiththeprojectedlatentpoints(`+')andtheir obtainedafter15iterationsofEM,isshownontheright.<br />
Figure5:<strong>The</strong>leftplotshowstheposterior-meanprojectionoftheoilowdatainthela- plus-signsrepresentstratied,annularandhomogeneousmulti-phasecongurations theclusters.<br />
visualizedusingprincipalcomponentanalysis.Inbothplots,crosses,circlesand respectively.Notehowthenon-linearityof<strong>GTM</strong>givesanimprovedseparationof tentspaceofthe<strong>GTM</strong>model,whiletheplotontherightshowsthesamedataset
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> multi-phaseconguration.Forcomparison,Figure5alsoshowsthecorrespondingresultsobtained 10<br />
4usingprincipalcomponentanalysis. isusefultoconsiderthepreciserelationshipbetween<strong>GTM</strong>andSOM.Wefocusourattentionon Sinceonemotivationfor<strong>GTM</strong>istoprovideaprincipledalternativetotheself-organizingmap,it RelationtotheSelf-OrganizingMap<br />
thebatchversionsofbothalgorithmsasthishelpstomaketherelationshipparticularlyclear. <strong>The</strong>batchversionoftheSOMalgorithm(Kohonen1995)canbedescribedasfollows.AsetofK referencevectorsziisdenedinthedataspace,inwhicheachvectorisassociatedwithanodeon randomvalues,bysettingthemequaltoarandomsubsetofthedatapoints,orbyusingprincipal aregularlatticeina(typically)two-dimensional`featuremap'(analogoustothelatentspaceof <strong>GTM</strong>).<strong>The</strong>algorithmbeginsbyinitializingthereferencevectors(forexamplebysettingthemto componentanalysis).Eachcycleofthealgorithmthenproceedsasfollows.Foreverydatavector tnthecorresponding`winningnode'j(n)isidentied,correspondingtothereferencevectorzj havingthesmallestEuclideandistancekzj?tnk2totn.<strong>The</strong>referencevectorsarethenupdated bysettingthemequaltoweightedaveragesofthedatapointsgivenby<br />
inwhichhijisaneighbourhoodfunctionassociatedwiththeithnode.Thisisgenerallychosento beauni-modalfunctionofthefeaturemapcoordinatescentredonthewinningnode,forexample zi=Pnhij(n)tn Pnhij(n): (21)<br />
functionhijstartswitharelativelylargevalueandisgraduallyreducedaftereachiteration. repeatediteratively.Akeyingredientinthealgorithmisthatthewidthoftheneighbourhood aGaussian.<strong>The</strong>stepsofidentifyingthewinningnodesandupdatingthereferencevectorsare<br />
4.1KernelversusLinearRegression<br />
vectortn.WecanthereforeperformpartialsumsoverthegroupsGjofdatavectorsassignedto AspointedoutbyMulierandCherkassky(1995),thevalueoftheneighbourhoodfunctionhij(n) eachnodej,andhencere-write(21)intheform dependsonlyontheidentityofthewinningnodejandnotonthevalueofthecorrespondingdata<br />
inwhichmjisthemeanofthevectorsingroupGjandisgivenby zi=XjKijmj (22)<br />
Watsonkernelregressionformula(Nadaraya1964;Watson1964)withthekernelfunctionsgiven whereNjisthenumberofdatavectorsingroupGj.<strong>The</strong>result(22)isanalogoustotheNadaraya- mj=1NjXn2Gjtn (23)<br />
by ThusthebatchSOMalgorithmreplacesthereferencevectorsateachcyclewithaconvexcombinationofthenodemeansmj,withcoecientsdeterminedbytheneighbourhoodfunction.Note Kij= Pj0hij0Nj0: hijNj (24)<br />
thatthekernelcoecientssatisfyPjKij=1foreveryi.
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 11<br />
0.04<br />
Figure6:ExampleoftheeectivekernelFijplottedasafunctionofthenodejforagiven<br />
0.02<br />
0.00<br />
Inthe<strong>GTM</strong>algorithm,thecentresy(xi;W)oftheGaussiancomponentscanberegardedas analogoustothe(normalized)neighbourhoodfunctionintheSOMalgorithm. nodei,fortheoilowdatasetafter3iterationsofEM.Thiskernelfunctionis<br />
analogoustothereferencevectorszioftheSOM.Wecanevaluatey(xi;W)bysolvingtheM-step<br />
−0.02<br />
equation(12)tondWandthenusingy(xi;W)=W(xi).Ifwedenetheweightedmeansof<br />
NotethattheeectivekernelsatisesPjFij=1.Toseethis,werstuse(27)toshowthat wherewehaveintroducedtheeectivekernelFijgivenby<br />
PjFijl(xj)=l(xi).<strong>The</strong>nifoneofthebasisfunctionslcorrespondstoabias,sothatl(x)= <strong>The</strong>solutionfory(xi;W)givenby(26)and(27)canbeinterpretedasaweightedleast-squares const:,theresultfollows. regression(Mardia,Kent,andBibby1979)inwhichthe`target'vectorsarethei,andthe weightingcoecientsaregivenbyGjj. Figure6showsanexampleoftheeectivekernelfor<strong>GTM</strong>correspondingtotheoilowproblem discussedinSection3. From(22)and(26)weseethatboth<strong>GTM</strong>andSOMcanberegardedasformsofkernelsmoothers. However,therearetwokeydierences.<strong>The</strong>rstisthatinSOMthevectorswhicharesmoothed, denedby(23),correspondtohardassignmentsofdatapointstonodes,whereasthecorresponding vectorsin<strong>GTM</strong>,givenby(25),involvesoftassignments,weightedbytheposteriorprobabilities. ThisisanalogoustothedistinctionbetweenK-meansclustering(hardassignments)andttinga standardGaussianmixturemodelusingEM(softassignments). <strong>The</strong>secondkeydierenceisthatthekernelfunctioninSOMismadetoshrinkduringthecourseof inlatentspace,foragivendatapoint,formsalocalised`bubble'andtheradiusofthisbubble thealgorithminanarbitrary,hand-craftedmanner.In<strong>GTM</strong>theposteriorprobabilitydistribution shrinksautomaticallyduringtraining,asshowninFigure7.Thisresponsibilitybubblegoverns theextenttowhichindividualdatapointscontributetowardsthevectorsiin(25)andhence towardstheupdatingoftheGaussiancentresy(xi;W)via(26).<br />
thenweobtain thedatavectorsby<br />
y(xi;W)=XjFijj i=PnRintn PnRin (25) (26)<br />
Fij=T(xi) TG?1(xj)Gjj: (27)<br />
0.08<br />
0.06
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 12<br />
Figure7:Examplesoftheposteriorprobabilities(responsibilities)Rinofthelatentspace andareplottedusinganon-linearscalingoftheformp(xjtn)0:1tohighlightthevariationoverthelatentspace.Noticehowtheresponsibility`bubble',whichgoverns singledatapointfromthetrainingsetintheoil-owproblemdiscussedinSection3, pointsatanearlystage(left),intermediatestage(centre)andlatestage(right) duringtheconvergenceofthe<strong>GTM</strong>algorithm.<strong>The</strong>sehavebeenevaluatedfora<br />
4.2Comparisonof<strong>GTM</strong>withSOM y(xi;W),shrinksautomaticallyduringthelearningprocess. theupdatingoftheweightmatrix,andhencetheupdatingofthedata-spacevectors<br />
<strong>The</strong>mostsignicantdierencebetweenthe<strong>GTM</strong>andSOMalgorithmsisthat<strong>GTM</strong>denesan explicitprobabilitydensitygivenbythemixturedistributionin(5).Asaconsequencethereis awell-denedobjectivefunctiongivenbytheloglikelihood(6),andconvergencetoa(local) maximumoftheobjectivefunctionisguaranteedbytheuseoftheEMalgorithm(Dempster, likelihoodofatestsetunderthegenerativedistributionsoftherespectivemodels.FortheSOM parameters,andeventocomparea<strong>GTM</strong>solutionwithanotherdensitymodel,byevaluatingthe Laird,andRubin1977).Thisalsoprovidesadirectmeanstocomparedierentchoicesofmodel algorithm,however,thereisnoprobabilitydensityandnowell-denedobjectivefunctionwhich isbeingminimizedbythetrainingprocess.Indeedithasbeenproven(Erwin,Obermayer,and Schulten1992)thatsuchanobjectivefunctioncannotexistfortheSOM. AfurtherlimitationoftheSOM,highlightedinKohonen(1995,page234),isthattheconditions underwhichso-called`self-organization'oftheSOMoccurshavenotbeenquantied,andsoin practiceitisnecessarytoconrmempiricallythatthetrainedmodeldoesindeedhavethedesired spatialordering.Incontrast,theneighbourhood-preservingnatureofthe<strong>GTM</strong>mappingisan automaticconsequenceofthechoiceofacontinuousfunctiony(x;W). bourhoodfunctionandbythewayinwhichitischangedduringthecourseofthealgorithm,andSimilarly,thesmoothnesspropertiesoftheSOMaredeterminedindirectlybythechoiceofneigh- isthereforediculttocontrol.Thus,priorknowledgeabouttheformofthemapcannoteasily bespecied.<strong>The</strong>priordistributionfor<strong>GTM</strong>,however,canbecontrolleddirectly,andproperties suchassmoothnessaregovernedexplicitlybybasisfunctionparameters,asillustratedinFigure3. Finally,weconsidertherelativecomputationalcostsofthe<strong>GTM</strong>andSOMalgorithms.For problemsinvolvingdatainhigh-dimensionalspacesthedominantcomputationalcostof<strong>GTM</strong> arisesfromtheevaluationoftheEuclideandistancesfromeverydatapointtoeveryGaussiancentre y(xi;W).SinceexactlythesamecalculationsmustbedoneforSOM(involvingthedistancesof datapointsfromthereferencevectorsi)weexpectoneiterationofeitheralgorithmtotake approximatelythesametime.Anempiricalcomparisonofthecomputationalcostof<strong>GTM</strong>and SOMwasobtainedbyrunningeachalgorithmontheoilowdatauntil`convergence'(dened asnodiscerniblechangeintheappearanceofthevisualizationmap).<strong>The</strong><strong>GTM</strong>algorithmtook
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 1058sec.(40iterations)whilethebatchSOMtook1011sec.(25iterations)usingaGaussian 13<br />
vectorisupdatedateachiterationusingonlydatapointsassociatedwithnearbyreferencevectors, neighbourhoodfunction.Withasimple`top-hat'neighbourhoodfunction,inwhicheachreference theCPUtimefortheSOMalgorithmisreducedto305sec.(25iterations).Onepotentialadvantage of<strong>GTM</strong>inpracticalapplicationsarisesfromareductioninthenumberofexperimentaltraining runsneededsincebothconvergenceandtopographicorderingareguaranteed.<br />
wereviewbrieythemostsignicantofthese. <strong>The</strong>reareseveralalgorithmsinthepublishedliteraturewhichhavecloselinkswith<strong>GTM</strong>.Here 5 RelationtoOtherAlgorithms<br />
<strong>The</strong>elasticnetalgorithmofDurbinandWillshaw(1987)canbeviewedasaGaussianmixture ofGaussianscorrespondingtoneighbouringpointsalongthe(typicallyone-dimensional)chainto densitymodel,ttedbypenalizedmaximumlikelihood.<strong>The</strong>penaltytermencouragesthecentres becloseindataspace.Itdiersfrom<strong>GTM</strong>inthatitdoesnotdeneacontinuousdataspace manifold.Also,thetrainingalgorithmgenerallyinvolvesahand-craftedannealingoftheweight<br />
ofprojectionfollowedbysmoothing,althoughthesearenotgenerativemodels.Itisinterestingto Stuetzle1989;LeBlancandTibshirani1994)whichagaininvolveatwo-stagealgorithmconsisting <strong>The</strong>rearealsosimilaritiesbetween<strong>GTM</strong>andprincipalcurvesandprincipalsurfaces(Hastieand penaltycoecient.<br />
notethatHastieandStuetzle(1989)proposereducingthespatialwidthofthesmoothingfunction duringlearning,inamanneranalogoustotheshrinkingoftheneighbourhoodfunctioninthe distributionbasedonamixtureofGaussians,withawell-denedlikelihoodfunction,andistrained SOM.Amodiedformoftheprincipalcurvesalgorithm(Tibshirani1992)introducesagenerative<br />
<strong>The</strong>techniqueofparametrizedself-organizingmaps(PSOMs)involvesrstttingastandardSOM bytheEMalgorithm.However,thenumberofGaussiancomponentsisequaltothenumberof<br />
modeltoadatasetandthenndingamanifoldindataspacewhichinterpolatesthereference aderivative-basedregularizationterm. datapoints,andsmoothingisimposedbypenalizingthelikelihoodfunctionwiththeadditionof<br />
vectors(Ritter1993).Althoughthisdenesacontinuousmanifold,theinterpolatingsurface inSection4.2,remain. <strong>The</strong>SOMhasalsobeenusedforvectorquantization.Inthiscontextithasbeenshownhow doesnotformpartofthetrainingalgorithm,andthebasicproblemsinusingSOM,discussed<br />
inlatentspacetoacomplexdistributionindataspacebypropagationthroughanon-linearnetwork. Finally,the`densitynetwork'modelofMacKay(1995)involvestransformingasimpledistribution are-formulationofthevectorquantizationproblem(Luttrell1990;BuhmannandKuhnel1993;<br />
Adiscretedistributioninlatentspaceisagainused,whichisinterpretedasanapproximateMonte Luttrell1994)canavoidmanyoftheproblemswiththeSOMprocedurediscussedearlier.<br />
areadaptedusingEM.<br />
regularratherthanstochastic,aspecicformofnon-linearityisused,andthemodelparameters Carlointegrationoverthelatentvariablesneededtodenethedataspacedistribution.<strong>GTM</strong> canbeseenasaparticularinstanceofthisframeworkinwhichthesamplingoflatentspaceis
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> 6 Discussion 14<br />
Inthispaperwehaveintroducedaformofnon-linearlatentvariablemodelwhichcanbetrained ecientlyusingtheEMalgorithm.Viewedasatopographicmappingalgorithm,ithasthekey propertythatitdenesaprobabilitydensitymodel. problemofdealingwithmissingvaluesinthedataset(inwhichsomecomponentsofthedata vectorstnareunobserved).Ifthemissingvaluesaremissingatrandom(LittleandRubin1987) Asanexampleofthesignicanceofhavingaprobabilitydensity,considertheimportantpractical thenthelikelihoodfunctionisobtainedbyintegratingouttheunobservedvalues.Forthe<strong>GTM</strong><br />
amixtureof<strong>GTM</strong>models.Inthiscasetheoveralldensitycanbewrittenas modeltheintegrationscanbeperformedanalytically,leadingtoasimplemodicationoftheEM algorithm. Afurtherconsequenceofhavingaprobabilisticapproachisthatitisstraightforwardtoconsider<br />
mixingcoecientssatisfying0 wherep(tjr)representstherthmodel,withitsownsetofindependentparameters,andP(r)are P(r) p(t)=XrP(r)p(tjr) 1andPrP(r)=1.Again,itisstraightforwardto (28)<br />
<strong>The</strong><strong>GTM</strong>algorithmcanbeextendedinotherways,forinstancebyallowingindependentmix- extendtheEMalgorithmtomaximizethecorrespondinglikelihoodfunction.<br />
exponentialappliedtoageneralizedlinearregressionmodel,althoughinthiscasetheM-stepof theEMalgorithmwouldinvolvenon-linearoptimization.Similarly,theinversenoisevariance rameters,theicanbedeterminedassmoothfunctionsofthelatentvariablesusinganormalizedingcoecientsi(priorprobabilities)foreachoftheGaussiancomponents,whichagaincanbeestimatedbyastraightforwardextensionoftheEMalgorithm.Insteadofbeingindependentpa- smoothmanifoldindataspace,whichallowsthelocal`magnicationfactor'betweenlatentand dataspacetobeevaluatedasafunctionofthelatentspacecoordinatesusingthetechniquesof dierentialgeometry(Bishop,Svensen,andWilliams1996b).Finally,sincethereisawell-dened likelihoodfunction,itisstraightforwardinprincipletointroducepriorsoverthemodelparameters canbegeneralizedtoafunctionofx.Animportantpropertyof<strong>GTM</strong>istheexistenceofa<br />
(asdiscussedinSection2.1)andtouseBayesiantechniquesinplaceofmaximumlikelihood. bemoreconvenienttoconsidersequentialadaptationinwhichdatapointsarepresentedoneata Throughoutthispaperwehavefocussedonthebatchversionofthe<strong>GTM</strong>algorithminwhichall ofthetrainingdataareusedtogethertoupdatethemodelparameters.Insomeapplicationsitwill time.Sinceweareminimizingadierentiablecostfunction,givenby(6),asequentialalgorithm canbeobtainedbyappealingtotheRobbins-Monroprocedure(RobbinsandMonro1951;Bishop Awebsitefor<strong>GTM</strong>isprovidedat: algorithmcanbeused(Titterington,Smith,andMakov1985). 1995)tondazerooftheobjectivefunctiongradient.Alternatively,asequentialformoftheEM<br />
whichincludespostscriptlesofrelevantpapers,asoftwareimplementationinMatlab(aCimplementationisunderdevelopment),andexampledatasetsusedinthedevelopmentofthe<strong>GTM</strong> http://www.ncrg.aston.ac.uk/<strong>GTM</strong>/<br />
algorithm.
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> Acknowledgements 15<br />
High-DimensionalData.WewouldliketothankGeoreyHinton,IainStrachanandMichael ThisworkwassupportedbyEPSRCgrantGR/K51808:NeuralNetworksforVisualizationof inStockholmfortheirhospitalityduringpartofthisproject. Tippingforusefuldiscussions.MarkusSvensenwouldliketothankthestaoftheSANSgroup<br />
References Bishop,C.M.(1995).NeuralNetworksforPatternRecognition.Oxford<strong>University</strong>Press.<br />
Bishop,C.M.,M.Svensen,andC.K.I.Williams(1996a).AfastEMalgorithmforlatentvariable Bishop,C.M.andG.D.James(1993).Analysisofmultiphaseowsusingdual-energy ResearchA327,580{593. gammadensitometryandneuralnetworks.NuclearInstrumentsandMethodsinPhysics<br />
Bishop,C.M.,M.Svensen,andC.K.I.Williams(1996b).Magnicationfactorsforthe<strong>GTM</strong> densitymodels.InD.S.Touretzky,M.C.Mozer,andM.E.Hasselmo(Eds.),Advancesin NeuralInformationProcessingSystems,Volume8,pp.465{471.MITPress.<br />
Buhmann,J.andK.Kuhnel(1993).Vectorquantizationwithcomplexitycosts.IEEETransac- algorithm.ToappearinProceedingsFifthIEEInternationalConferenceonArticialNeural<br />
Dempster,A.P.,N.M.Laird,andD.B.Rubin(1977).Maximumlikelihoodfromincomplete Networks. tionsonInformation<strong>The</strong>ory39(4),1133{1145.<br />
Erwin,E.,K.Obermayer,andK.Schulten(1992).Self-organizingmaps:ordering,convergence Durbin,R.andD.Willshaw(1987).Ananalogueapproachtothetravellingsalesmanproblem. Nature326,689{691. dataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety,B39(1),1{38.<br />
Hastie,T.andW.Stuetzle(1989).Principalcurves.JournaloftheAmericanStatisticalAssoHinton,G.E.,C.K.I.Williams,andM.D.Revow(1992).Adaptiveelasticmodelsforhand- propertiesandenergyfunctions.BiologicalCybernetics67,47{55. ciation84(406),502{516.printedcharacterrecognition.InJ.E.Moody,S.J.Hanson,andR.P.Lippmann(Eds.),AdvancesinNeuralInformationProcessingSystems,Volume4,pp.512{519.MorganKau- Kohonen,T.(1995).Self-OrganizingMaps.Berlin:Springer-Verlag. Kohonen,T.(1982).Self-organizedformationoftopologicallycorrectfeaturemaps.Biological Cybernetics43,59{69. mann.<br />
Little,R.J.A.andD.B.Rubin(1987).StatisticalAnalysiswithMissingData.NewYork:John LeBlanc,M.andR.Tibshirani(1994).Adaptiveprincipalsurfaces.JournaloftheAmerican StatisticalAssociation89(425),53{64.<br />
Luttrell,S.P.(1994).ABayesiananalysisofself-organizingmaps.NeuralComputation6(5), Luttrell,S.P.(1990).Derivationofaclassoftrainingalgorithms.IEEETransactionsonNeural Networks1(2),229{232. Wiley.<br />
MacKay,D.J.C.(1995).Bayesianneuralnetworksanddensitynetworks.NuclearInstruments Mardia,K.,J.Kent,andM.Bibby(1979).Multivariateanalysis.AcademicPress.<br />
767{794. andMethodsinPhysicsResearch,A354(1),73{80.
<strong>GTM</strong>:<strong>The</strong><strong>Generative</strong><strong>Topographic</strong><strong>Mapping</strong> Mulier,F.andV.Cherkassky(1995).Self-organizationasaniterativekernelsmoothingprocess. 16<br />
Nadaraya,E.A.(1964).Onestimatingregression.<strong>The</strong>oryofProbabilityanditsApplica- Ritter,H.(1993).Parametrizedself-organizingmaps.InProceedingsICANN'93International NeuralComputation7(6),1165{1177. tions9(1),141{142.<br />
Tibshirani,R.(1992).Principalcurvesrevisited.StatisticsandComputing2,183{190. Robbins,H.andS.Monro(1951).Astochasticapproximationmethod.AnnalsofMathematical ConferenceonArticialNeuralNetworks,Amsterdam,pp.568{575.Springer-Verlag.<br />
Titterington,D.M.,A.F.M.Smith,andU.E.Makov(1985).StatisticalAnalysisofFinite Statistics22,400{407.<br />
Watson,G.S.(1964).Smoothregressionanalysis.Sankhya:<strong>The</strong>IndianJournalofStatistics. SeriesA26,359{372.<br />
MixtureDistributions.NewYork:JohnWiley.