GTM: The Generative Topographic Mapping - Aston University ...

DeptofComputerScience&AppliedMathematics NeuralComputingResearchGroup BirminghamB47ET AstonUniversity 

http://www.ncrg.aston.ac.uk/ Fax:+44(0)1213334586 Tel:+44(0)1213334631 UnitedKingdom 

GTM:TheGenerative 

ChristopherM.Bishop TopographicMapping 

C.M.Bishop@aston.ac.uk ChristopherK.I.Williams svensjfm@aston.ac.uk MarkusSvensen 

TechnicalReportNCRG/96/015C.K.I.Williams@aston.ac.uk AcceptedforpublicationinNeuralComputation. April16,1997 

Latentvariablemodelsrepresenttheprobabilitydensityofdatainaspaceofseveraldimensions whichisbasedonalineartransformationsbetweenthelatentspaceandthedataspace.Inthis intermsofasmallernumberoflatent,orhidden,variables.Afamiliarexampleisfactoranalysis Abstract 

paperweintroduceaformofnon-linearlatentvariablemodelcalledtheGenerativeTopographic MappingforwhichtheparametersofthemodelcanbedeterminedusingtheEMalgorithm.GTM providesaprincipledalternativetothewidelyusedSelf-OrganizingMap(SOM)ofKohonen(1982), andovercomesmostofthesignicantlimitationsoftheSOM.Wedemonstratetheperformanceof theGTMalgorithmonatoyproblemandonsimulateddatafromowdiagnosticsforamulti-phase oilpipeline.

GTM:TheGenerativeTopographicMapping 1 Introduction 2 

structureistomodelthedistributionofthedataintermsoflatent,orhidden,variables.A Manydatasetsexhibitsignicantcorrelationsbetweenthevariables.Onewaytocapturesuch familiarexampleofthisapproachisfactoranalysis,whichisbasedonalineartransformation beextendedtoallownon-lineartransformationswhileremainingcomputationallytractable.This fromlatentspacetodataspace.Inthispaperweshowhowthelatentvariableframeworkcan algorithm. Oneofthemotivationsforthisworkistoprovideaprincipledalternativetothewidelyused`self- mixtureofGaussianswhoseparameterscanbeoptimizedusingtheEM(expectation-maximization) leadstotheGTM(GenerativeTopographicMapping)algorithm,whichisbasedonaconstrained 

organizingmap'(SOM)algorithm(Kohonen1982)inwhichasetofunlabelleddatavectorstn (n=1;:::;N)inaD-dimensionaldataspaceissummarizedintermsofasetofreferencevectors havingaspatialorganizationcorrespondingtoa(generally)two-dimensionalsheet.Whilethis algorithmhasachievedmanysuccessesinpracticalapplications,italsosuersfromsomesignicant costfunction,thelackofatheoreticalbasisforchoosinglearningrateparameterschedulesand neighbourhoodparameterstoensuretopographicordering,theabsenceofanygeneralproofsof deciencies,manyofwhicharehighlightedinKohonen(1995).Theyinclude:theabsenceofa convergence,andthefactthatthemodeldoesnotdeneaprobabilitydensity.Theseproblemscan 

usedinvisualizationareregardedasdeningaprojectionfromtheD-dimensionaldataspace Animportantapplicationoflatentvariablemodelsistodatavisualization.Manyofthemodels overcomesmostofthelimitationsoftheSOMwhileintroducingnosignicantdisadvantages. allbetracedtotheheuristicoriginsoftheSOMalgorithm1.WeshowthattheGTMalgorithm 

ontoatwo-dimensionalvisualizationspace.Weshallseethat,bycontrast,theGTMmodelis datavisualization,themappingistheninvertedusingBayes'theorem,givingrisetoaposterior denedintermsofamappingfromthelatentspaceintothedataspace.Forthepurposesof distributioninlatentspace. 2Thegoalofalatentvariablemodelistondarepresentationforthedistributionp(t)ofdataina D-dimensionalspacet=(t1;:::;tD)intermsofanumberLoflatentvariablesx=(x1;:::;xL). LatentVariables 

Thisisachievedbyrstconsideringafunctiony(x;W)whichmapspointsxinthelatentspace intocorrespondingpointsy(x;W)inthedataspace.Themappingisgovernedbyamatrix ofparametersW,andcouldconsist,forexample,ofafeed-forwardneuralnetworkinwhich caseWwouldrepresenttheweightsandbiases.Weareinterestedinthesituationinwhichthe dimensionalityLofthelatent-variablespaceislessthanthedimensionalityDofthedataspace, sincewewishtocapturethefactthatthedataitselfhasanintrinsicdimensionalitywhichisless thanD.Thetransformationy(x;W)thenmapsthelatent-variablespaceintoanL-dimensional non-EuclideanmanifoldSembeddedwithinthedataspace2.Thisisillustratedschematicallyfor thecaseofL=2andD=3inFigure1. Ifwedeneaprobabilitydistributionp(x)onthelatent-variablespace,thiswillinduceacorre- 

goalhereisnotneuro-biologicalmodelling,butratherthedevelopmentofeectivealgorithmsfordataanalysis,for spondingdistributionp(yjW)inthedataspace.Weshallrefertop(x)asthepriordistributionof 

whichbiologicalrealismneednotbeconsidered. xforreasonswhichwillbecomeclearshortly.SinceL

GTM:TheGenerativeTopographicMapping 3 

y(x;W) 

x2 x 

t 

connedtotheL-dimensionalmanifoldandhencewouldbesingular.Sinceinrealitythedatawill Figure1:Thenon-linearfunctiony(x;W)denesamanifoldSembeddedindataspacegiven bytheimageofthelatent-variablespaceunderthemappingx!y. 

2 

1 

t3 onlyapproximatelyliveonalower-dimensionalmanifold,itisappropriatetoincludeanoisemodel forthetvector.Wechoosethedistributionoft,forgivenxandW,tobearadially-symmetric Gaussiancentredony(x;W)havingvariance?1sothat 

Notethatothermodelsforp(tjx)mightalsobeappropriate,suchasaBernoulliforbinaryvariables (withasigmoidtransformationofy)oramultinomialformutuallyexclusiveclasses(witha p(tjx;W;)=2D=2exp?2ky(x;W)?tk2: (1) 

`softmax',ornormalizedexponentialtransformationofy(Bishop1995)),orevencombinationsof these.Thedistributionint-space,foragivenvalueofW,isthenobtainedbyintegrationoverthe 

W,andtheinversevariance,usingmaximumlikelihood.Inpracticeitisconvenienttomaximize ForagivenadatasetD=(t1;:::;tN)ofNdatapoints,wecandeterminetheparametermatrix 

Oncewehavespeciedthepriordistributionp(x)andthefunctionalformofthemappingy(x;W), wecaninprincipledetermineWandbymaximizingL(W;).However,theintegraloverxin (2)will,ingeneral,beanalyticallyintractable.Ifwechoosey(x;W)tobealinearfunctionof W,andwechoosep(x)tobeGaussian,thentheintegralbecomesaconvolutionoftwoGaussians whichisitselfaGaussian.Foranoisedistributionp(tjx)whichisGaussianwithadiagonal themaximumlikelihoodsolutionforWhascolumnsgivenbythescaledprincipaleigenvectors. symmetricGaussiangivenby(1)themodeliscloselyrelatedtoprincipalcomponentanalysissince covariancematrix,weobtainthestandardfactoranalysismodel.Inthecaseoftheradially Herewewishtoextendthisformalismtonon-linearfunctionsy(x;W),andinparticulartodevelop amodelwhichissimilarinspirittotheSOMalgorithm.Wethereforeconsideraspecicformfor p(x)givenbyasumofdeltafunctionscentredonthenodesofaregulargridinlatentspace 

x-distribution p(tjW;)=Zp(tjx;W;)p(x)dx: (2) 

theloglikelihood,givenby L(W;)=lnNYn=1p(tnjW;): (3) 

t 1 

p(x)=1KKXi=1(x?xi) (4) 

S


y(x;W) 

x2 Figure2:InordertoformulatealatentvariablemodelwhichissimilarinspirittotheSOM, weconsiderapriordistributionp(x)consistingofasuperpositionofdeltafunctions, 

t2 x1 t 

Gaussiandistribution. locatedatthenodesofaregulargridinlatentspace.Eachnodexiismappedtoa correspondingpointy(xi;W)indataspace,andformsthecentreofacorresponding 

3 

inwhichcasetheintegralin(2)canagainbeperformedanalytically.Eachpointxiisthenmapped toacorrespondingpointy(xi;W)indataspace,whichformsthecentreofaGaussiandensity function,asillustratedinFigure2.From(2)and(4)weseethatthedistributionfunctionindata 

andtheloglikelihoodfunctionbecomes 

Fortheparticularnoisemodelp(tjx;W;)givenby(1),thedistributionp(tjW;)corresponds toaconstrainedGaussianmixturemodel(Hinton,Williams,andRevow1992)sincethecentres oftheGaussians,givenbyy(xi;W),cannotmoveindependentlybutarerelatedthroughthe functiony(x;W).Notethat,providedthemappingfunctiony(x;W)issmoothandcontinuous, theprojectedpointsy(xi;W)willnecessarilyhaveatopographicorderinginthesensethatany twopointsxAandxBwhicharecloseinlatentspacewillmaptopointsy(xA;W)andy(xB;W) whicharecloseindataspace. 2.1TheEMAlgorithm Ifwenowchooseaparticularparametrizedformfory(x;W)whichisadierentiablefunctionof W(forexample,afeed-forwardnetworkwithsigmoidalhiddenunits)thenwecanusestandard techniquesfornon-linearoptimization,suchasconjugategradientsorquasi-Newtonmethods,to EM(expectation-maximization)algorithm(Dempster,Laird,andRubin1977;Bishop1995).By However,ourmodelconsistsofamixturedistributionwhichsuggeststhatwemightseekan makingasuitablechoiceofmodely(x;W)wewillseethattheM-stepcorrespondstothesolution ndaweightmatrixW,andaninversevariance,whichmaximizeL(W;). 

ofasetoflinearequations.Inparticularweshallchoosey(x;W)tobegivenbyageneralized 

spacethentakestheform p(tjW;)=1KKXi=1p(tjxi;W;) (5) 

L(W;)=NXn=1ln(1KKXi=1p(tnjxi;W;)): (6) 

t 1

GTM:TheGenerativeTopographicMapping linearregressionmodeloftheform 5 

wheretheelementsof(x)consistofMxedbasisfunctionsj(x),andWisaDMmatrix. Generalizedlinearregressionmodelspossessthesameuniversalapproximationcapabilitiesasmultilayeradaptivenetworks,providedthebasisfunctionsj(x)arechosenappropriately.Theusual y(x;W)=W(x) (7) 

limitationofsuchmodels,however,isthatthenumberofbasisfunctionsmusttypicallygrow exponentiallywiththedimensionalityLoftheinputspace(Bishop1995).Inthepresentcontext thisisnotasignicantproblemsincethedimensionalityisgovernedbythenumberoflatent variablevariableswhichwilltypicallybesmall.Infactfordatavisualizationapplicationswe componentwhichgeneratedeachdatapointtnisunknown.WecanformulatetheEMalgorithm forthismodelasfollows.First,supposethat,atsomepointinthealgorithm,thecurrentweight generallyuseL=2. Themaximizationof(6)canberegardedasamissing-dataprobleminwhichtheidentityiofthe matrixisgivenbyWoldandthecurrentinversenoisevarianceisgivenbyold.IntheE-step weuseWoldandoldtoevaluatetheposteriorprobabilities,orresponsibilities,ofeachGaussian componentiforeverydatapointtnusingBayes'theoremintheform Rin(Wold;old)=p(xijtn;Wold;old) = XKi0=1p(tnjxi0;Wold;old): p(tnjxi;Wold;old) (9) (8) 

Wenowconsidertheexpectationofthecomplete-dataloglikelihoodintheform 

Maximizing(10)withrespecttoW,andusing(1)and(7),weobtain hLcomp(W;)i=NXn=1KXi=1Rin(Wold;old)lnfp(tnjxi;W;)g: NXn=1KXi=1Rin(Wold;old)fWnew(xi)?tngT(xi)=0: 

(10) 

Thiscanconvenientlybewritteninmatrixnotationintheform TGoldWTnew=TRoldT 

(11) 

RisaKNmatrixwithelementsRin,andGisaKKdiagonalmatrixwithelements whereisaKMmatrixwithelementsij=j(xi),TisaNDmatrixwithelementstnk, (12) 

Wecannowsolve(12)forWnewusingstandardmatrixinversiontechniques,basedonsingular valuedecompositiontoallowforpossibleill-conditioning.Notethatthematrix Gii=NXn=1Rin(W;): isconstant (13) 

Similarly,maximizing(10)withrespecttoweobtainthefollowingre-estimationformula throughoutthealgorithm,andsoneedonlybeevaluatedonceatthestart. 

TheEMalgorithmalternatesbetweentheE-step,correspondingtotheevaluationoftheposterior 1new=1 NDNXn=1KXi=1Rin(Wold;old)kWnew(xi)?tnk2: (14) 

probabilitiesin(9),andtheM-step,givenbythesolutionof(12)and(14).Jensen'sinequality

GTM:TheGenerativeTopographicMapping canbeusedtoshowthat,ateachiterationofthealgorithm,theobjectivefunctionwillincrease 6 

unlessitisalreadyata(local)maximum,asdiscussedforexampleinBishop(1995).Typically theEMalgorithmgivessatisfactoryconvergenceafterafewtensofcycles,particularlysincewe areprimarilyinterestedinconvergenceofthedistributionandthisisoftenachievedmuchmore rapidlythanconvergenceoftheparametersthemselves. Ifdesired,aregularizationtermcanbeaddedtotheobjectivefunctiontocontrolthemapping y(x;W).ThiscanbeinterpretedasaMAP(maximuma-posteriori)estimatorcorrespondingtoa choiceofpriorovertheweightsW.Inthecaseofaradially-symmetricGaussianprioroftheform 

whereistheregularizationcoecient,thisleadstoamodicationoftheM-step(12)togive p(Wj)=2MD=2exp8


Figure3:ExamplesofmanifoldsgeneratedbysamplingfromthepriordistributionoverW left-handplot(wheresisthespacingofthebasisfunctioncentres),and=2s intheright-handplot.Dierentvaluesofsimplyaectthelinearscalingofthe ofthemanifold.HerethebasisfunctionsareGaussianwithwidth=4sinthe givenby(15),showingtheeectofthechoiceofbasisfunctionsonthesmoothness 

isitselfpartofthisprior.IntheGTMalgorithm,thepriordistributionovermappingfunctions y(x;W)isgovernedbytheprioroverweightsW,givenforexampleby(15),aswellasbythe embeddedmanifold. 

basisfunctions.Wetypicallychoosethebasisfunctionsj(x)toberadiallysymmetricGaussians whosecentresaredistributedonauniformgridinx-space,withacommonwidthparameter, whosevalue,alongwiththenumberandspacingofthebasisfunctions,determinesthesmoothness ofthemanifold.ExamplesofsurfacesgeneratedbysamplingthepriorareshowninFigure3. Inadditiontothebasisfunctionsi(x),itisalsonecessarytoselectthelatent-spacesamplepoints thentheGaussianmixturecentresindataspacebecomerelativelyindependentandthedesired dicultybeyondincreasedcomputationalcost.Inparticular,thereisno`over-tting'ifthenumber fxig.Notethat,iftherearetoofewsamplepointsinrelationtothenumberofbasisfunctions, smoothnesspropertiescanbelost.Havingalargenumberofsamplepoints,however,causesno ofsamplepointsisincreasedsincethenumberofdegreesoffreedominthemodeliscontrolledby themappingfunctiony(x;W).Onewaytoviewtheroleofthelatentspacesamplesfxigisas aMonteCarloapproximationtotheintegraloverxin(2)(MacKay1995;Bishop,Svensen,and function. ofatwo-dimensionallatentspace,O(100)samplepointsliewithin2ofthecentreofeachbasis Williams1996a).ThechoiceofthenumberKandlocationofthesamplepointsxiinlatentspace 

Notethatwehaveconsideredthebasisfunctionparameters(widthsandlocations)tobexed,with isnotcritical,andwetypicallychooseGaussianbasisfunctionsandsetKsothat,inthecase 

aGaussianpriorontheweightmatrixW.Inprinciple,priorsoverthebasisfunctionparameters couldalsobeintroduced,andthesecouldagainbetreatedbyMAPestimationorbyBayesian analysis.Todothis,werstevaluatethedatacovariancematrixandobtaintherstandsecond integration. WeinitializetheparametersWsothattheGTMmodelinitiallyapproximatesprincipalcomponent principaleigenvectors,andthenwedetermineWbyminimizingtheerrorfunction 

wherethecolumnsofUaregivenbytheeigenvectors.Thisrepresentsthesum-of-squareserror 

E=12XikW(xi)?Uxik (20)

GTM:TheGenerativeTopographicMapping betweentheprojectionsofthelatentpointsintodataspacebytheGTMmodelandthecorre- 8 

Finally,wenotethatinanumericalimplementationcaremustbetakenovertheevaluationof orthesquareofhalfofthegridspacingofthePCA-projectedlatentpointsindataspace. spondingprojectionsobtainedfromPCA.Thevalueof?1isinitializedtobethelargerofeither 

theresponsibilitiessincethisinvolvescomputingtheexponentialsofthedistancesbetweenthe theL+1eigenvaluefromPCA(representingthevarianceofthedataawayfromthePCAplane) 

projectedlatentpointsandthedatapoints,whichmayspanasignicantrangeofvalues. 2.4SummaryoftheGTMAlgorithm itselfisstraightforwardandissummarizedhereforconvenience. Althoughtheforegoingdiscussionhasbeensomewhatdetailed,theunderlyingGTMalgorithm GTMconsistsofaconstrainedmixtureofGaussiansinwhichthemodelparametersaredetermined bymaximumlikelihoodusingtheEMalgorithm.Itisdenedbyspecifyingasetofpointsfxigin latentspace,togetherwithasetofbasisfunctionsfj(x)g.TheadaptiveparametersWand deneaconstrainedmixtureofGaussianswithcentresW(xi)andacommoncovariancematrix re-estimatedusing(12)and(14)respectively.Evaluationoftheloglikelihoodusing(6)attheend givenby?1I.AfterinitializingWand,traininginvolvesalternatingbetweentheE-stepin whichtheposteriorprobabilitiesareevaluatedusing(9),andtheM-stepinwhichWandare ofeachcyclecanbeusedtomonitorconvergence. 

Wenowpresentresultsfromtheapplicationofthisalgorithmrsttoatoyprobleminvolving dataintwodimensions,andthentoamorerealisticprobleminvolving12-dimensionaldataarising 3 ExperimentalResults 

fromdiagnosticmeasurementsofoilowsalongmulti-phasepipelines.Inbothexampleswechoose thebasisfunctionsj(x)toberadiallysymmetricGaussianswhosecentresaredistributedona uniformgridinx-space,withacommonwidthparameterchosenequaltotwicetheseparationof neighbouringbasisfunctioncentres.Resultsfromatoyproblemforthecaseofa2-dimensional dataspaceanda1-dimensionallatentspaceareshowninFigure4. 

Oursecondexamplearisesfromtheproblemofdeterminingthefractionofoilinamulti-phase 3.1OilFlowData 

sistsof12measurementstakenfromdual-energygammadensitometersmeasuringtheattenuationpipelinecarryingamixtureofoil,waterandgas(BishopandJames1993).Eachdatapointcon- photonstatistics).Thethreephasesinthepipe(oil,waterandgas)canbelongtooneofthree accuratelytheattenuationprocessesinthepipe,aswellasthepresenceofnoise(arisingfrom dierentgeometricalcongurations,correspondingtolaminar,homogeneous,andannularows, ofgammabeamspassingthroughthepipe.Syntheticallygenerateddataisusedwhichmodels andthedatasetconsistsof1000pointsdrawnwithequalprobabilityfromthe3congurations. Wetakethelatent-variablespacetobetwo-dimensional,sinceourgoalisdatavisualization. Figure5showstheoildatavisualizedinthelatent-variablespaceinwhich,foreachdatapoint, wehaveplottedtheposteriormeanvector.Eachpointhasthenbeenlabelledaccordingtoits


Figure4:Resultsfromatoyprobleminvolvingdata(`')generatedfroma1-dimensionalcurve Gaussiannoisedistributions(lledcircles).Theinitialconguration,determinedby principalcomponentanalysis,isshownontheleft,andtheconvergedconguration, embeddedin2dimensions,togetherwiththeprojectedlatentpoints(`+')andtheir obtainedafter15iterationsofEM,isshownontheright. 

Figure5:Theleftplotshowstheposterior-meanprojectionoftheoilowdatainthela- plus-signsrepresentstratied,annularandhomogeneousmulti-phasecongurations theclusters. 

visualizedusingprincipalcomponentanalysis.Inbothplots,crosses,circlesand respectively.Notehowthenon-linearityofGTMgivesanimprovedseparationof tentspaceoftheGTMmodel,whiletheplotontherightshowsthesamedataset

GTM:TheGenerativeTopographicMapping multi-phaseconguration.Forcomparison,Figure5alsoshowsthecorrespondingresultsobtained 10 

4usingprincipalcomponentanalysis. isusefultoconsiderthepreciserelationshipbetweenGTMandSOM.Wefocusourattentionon SinceonemotivationforGTMistoprovideaprincipledalternativetotheself-organizingmap,it RelationtotheSelf-OrganizingMap 

thebatchversionsofbothalgorithmsasthishelpstomaketherelationshipparticularlyclear. ThebatchversionoftheSOMalgorithm(Kohonen1995)canbedescribedasfollows.AsetofK referencevectorsziisdenedinthedataspace,inwhicheachvectorisassociatedwithanodeon randomvalues,bysettingthemequaltoarandomsubsetofthedatapoints,orbyusingprincipal aregularlatticeina(typically)two-dimensional`featuremap'(analogoustothelatentspaceof GTM).Thealgorithmbeginsbyinitializingthereferencevectors(forexamplebysettingthemto componentanalysis).Eachcycleofthealgorithmthenproceedsasfollows.Foreverydatavector tnthecorresponding`winningnode'j(n)isidentied,correspondingtothereferencevectorzj havingthesmallestEuclideandistancekzj?tnk2totn.Thereferencevectorsarethenupdated bysettingthemequaltoweightedaveragesofthedatapointsgivenby 

inwhichhijisaneighbourhoodfunctionassociatedwiththeithnode.Thisisgenerallychosento beauni-modalfunctionofthefeaturemapcoordinatescentredonthewinningnode,forexample zi=Pnhij(n)tn Pnhij(n): (21) 

functionhijstartswitharelativelylargevalueandisgraduallyreducedaftereachiteration. repeatediteratively.Akeyingredientinthealgorithmisthatthewidthoftheneighbourhood aGaussian.Thestepsofidentifyingthewinningnodesandupdatingthereferencevectorsare 

4.1KernelversusLinearRegression 

vectortn.WecanthereforeperformpartialsumsoverthegroupsGjofdatavectorsassignedto AspointedoutbyMulierandCherkassky(1995),thevalueoftheneighbourhoodfunctionhij(n) eachnodej,andhencere-write(21)intheform dependsonlyontheidentityofthewinningnodejandnotonthevalueofthecorrespondingdata 

inwhichmjisthemeanofthevectorsingroupGjandisgivenby zi=XjKijmj (22) 

Watsonkernelregressionformula(Nadaraya1964;Watson1964)withthekernelfunctionsgiven whereNjisthenumberofdatavectorsingroupGj.Theresult(22)isanalogoustotheNadaraya- mj=1NjXn2Gjtn (23) 

by ThusthebatchSOMalgorithmreplacesthereferencevectorsateachcyclewithaconvexcombinationofthenodemeansmj,withcoecientsdeterminedbytheneighbourhoodfunction.Note Kij= Pj0hij0Nj0: hijNj (24) 

thatthekernelcoecientssatisfyPjKij=1foreveryi.


0.04 

Figure6:ExampleoftheeectivekernelFijplottedasafunctionofthenodejforagiven 

0.02 

0.00 

IntheGTMalgorithm,thecentresy(xi;W)oftheGaussiancomponentscanberegardedas analogoustothe(normalized)neighbourhoodfunctionintheSOMalgorithm. nodei,fortheoilowdatasetafter3iterationsofEM.Thiskernelfunctionis 

analogoustothereferencevectorszioftheSOM.Wecanevaluatey(xi;W)bysolvingtheM-step 

−0.02 

equation(12)tondWandthenusingy(xi;W)=W(xi).Ifwedenetheweightedmeansof 

NotethattheeectivekernelsatisesPjFij=1.Toseethis,werstuse(27)toshowthat wherewehaveintroducedtheeectivekernelFijgivenby 

PjFijl(xj)=l(xi).Thenifoneofthebasisfunctionslcorrespondstoabias,sothatl(x)= Thesolutionfory(xi;W)givenby(26)and(27)canbeinterpretedasaweightedleast-squares const:,theresultfollows. regression(Mardia,Kent,andBibby1979)inwhichthe`target'vectorsarethei,andthe weightingcoecientsaregivenbyGjj. Figure6showsanexampleoftheeectivekernelforGTMcorrespondingtotheoilowproblem discussedinSection3. From(22)and(26)weseethatbothGTMandSOMcanberegardedasformsofkernelsmoothers. However,therearetwokeydierences.TherstisthatinSOMthevectorswhicharesmoothed, denedby(23),correspondtohardassignmentsofdatapointstonodes,whereasthecorresponding vectorsinGTM,givenby(25),involvesoftassignments,weightedbytheposteriorprobabilities. ThisisanalogoustothedistinctionbetweenK-meansclustering(hardassignments)andttinga standardGaussianmixturemodelusingEM(softassignments). ThesecondkeydierenceisthatthekernelfunctioninSOMismadetoshrinkduringthecourseof inlatentspace,foragivendatapoint,formsalocalised`bubble'andtheradiusofthisbubble thealgorithminanarbitrary,hand-craftedmanner.InGTMtheposteriorprobabilitydistribution shrinksautomaticallyduringtraining,asshowninFigure7.Thisresponsibilitybubblegoverns theextenttowhichindividualdatapointscontributetowardsthevectorsiin(25)andhence towardstheupdatingoftheGaussiancentresy(xi;W)via(26). 

thenweobtain thedatavectorsby 

y(xi;W)=XjFijj i=PnRintn PnRin (25) (26) 

Fij=T(xi) TG?1(xj)Gjj: (27) 

0.08 

0.06


Figure7:Examplesoftheposteriorprobabilities(responsibilities)Rinofthelatentspace andareplottedusinganon-linearscalingoftheformp(xjtn)0:1tohighlightthevariationoverthelatentspace.Noticehowtheresponsibility`bubble',whichgoverns singledatapointfromthetrainingsetintheoil-owproblemdiscussedinSection3, pointsatanearlystage(left),intermediatestage(centre)andlatestage(right) duringtheconvergenceoftheGTMalgorithm.Thesehavebeenevaluatedfora 

4.2ComparisonofGTMwithSOM y(xi;W),shrinksautomaticallyduringthelearningprocess. theupdatingoftheweightmatrix,andhencetheupdatingofthedata-spacevectors 

ThemostsignicantdierencebetweentheGTMandSOMalgorithmsisthatGTMdenesan explicitprobabilitydensitygivenbythemixturedistributionin(5).Asaconsequencethereis awell-denedobjectivefunctiongivenbytheloglikelihood(6),andconvergencetoa(local) maximumoftheobjectivefunctionisguaranteedbytheuseoftheEMalgorithm(Dempster, likelihoodofatestsetunderthegenerativedistributionsoftherespectivemodels.FortheSOM parameters,andeventocompareaGTMsolutionwithanotherdensitymodel,byevaluatingthe Laird,andRubin1977).Thisalsoprovidesadirectmeanstocomparedierentchoicesofmodel algorithm,however,thereisnoprobabilitydensityandnowell-denedobjectivefunctionwhich isbeingminimizedbythetrainingprocess.Indeedithasbeenproven(Erwin,Obermayer,and Schulten1992)thatsuchanobjectivefunctioncannotexistfortheSOM. AfurtherlimitationoftheSOM,highlightedinKohonen(1995,page234),isthattheconditions underwhichso-called`self-organization'oftheSOMoccurshavenotbeenquantied,andsoin practiceitisnecessarytoconrmempiricallythatthetrainedmodeldoesindeedhavethedesired spatialordering.Incontrast,theneighbourhood-preservingnatureoftheGTMmappingisan automaticconsequenceofthechoiceofacontinuousfunctiony(x;W). bourhoodfunctionandbythewayinwhichitischangedduringthecourseofthealgorithm,andSimilarly,thesmoothnesspropertiesoftheSOMaredeterminedindirectlybythechoiceofneigh- isthereforediculttocontrol.Thus,priorknowledgeabouttheformofthemapcannoteasily bespecied.ThepriordistributionforGTM,however,canbecontrolleddirectly,andproperties suchassmoothnessaregovernedexplicitlybybasisfunctionparameters,asillustratedinFigure3. Finally,weconsidertherelativecomputationalcostsoftheGTMandSOMalgorithms.For problemsinvolvingdatainhigh-dimensionalspacesthedominantcomputationalcostofGTM arisesfromtheevaluationoftheEuclideandistancesfromeverydatapointtoeveryGaussiancentre y(xi;W).SinceexactlythesamecalculationsmustbedoneforSOM(involvingthedistancesof datapointsfromthereferencevectorsi)weexpectoneiterationofeitheralgorithmtotake approximatelythesametime.AnempiricalcomparisonofthecomputationalcostofGTMand SOMwasobtainedbyrunningeachalgorithmontheoilowdatauntil`convergence'(dened asnodiscerniblechangeintheappearanceofthevisualizationmap).TheGTMalgorithmtook

GTM:TheGenerativeTopographicMapping 1058sec.(40iterations)whilethebatchSOMtook1011sec.(25iterations)usingaGaussian 13 

vectorisupdatedateachiterationusingonlydatapointsassociatedwithnearbyreferencevectors, neighbourhoodfunction.Withasimple`top-hat'neighbourhoodfunction,inwhicheachreference theCPUtimefortheSOMalgorithmisreducedto305sec.(25iterations).Onepotentialadvantage ofGTMinpracticalapplicationsarisesfromareductioninthenumberofexperimentaltraining runsneededsincebothconvergenceandtopographicorderingareguaranteed. 

wereviewbrieythemostsignicantofthese. ThereareseveralalgorithmsinthepublishedliteraturewhichhavecloselinkswithGTM.Here 5 RelationtoOtherAlgorithms 

TheelasticnetalgorithmofDurbinandWillshaw(1987)canbeviewedasaGaussianmixture ofGaussianscorrespondingtoneighbouringpointsalongthe(typicallyone-dimensional)chainto densitymodel,ttedbypenalizedmaximumlikelihood.Thepenaltytermencouragesthecentres becloseindataspace.ItdiersfromGTMinthatitdoesnotdeneacontinuousdataspace manifold.Also,thetrainingalgorithmgenerallyinvolvesahand-craftedannealingoftheweight 

ofprojectionfollowedbysmoothing,althoughthesearenotgenerativemodels.Itisinterestingto Stuetzle1989;LeBlancandTibshirani1994)whichagaininvolveatwo-stagealgorithmconsisting TherearealsosimilaritiesbetweenGTMandprincipalcurvesandprincipalsurfaces(Hastieand penaltycoecient. 

notethatHastieandStuetzle(1989)proposereducingthespatialwidthofthesmoothingfunction duringlearning,inamanneranalogoustotheshrinkingoftheneighbourhoodfunctioninthe distributionbasedonamixtureofGaussians,withawell-denedlikelihoodfunction,andistrained SOM.Amodiedformoftheprincipalcurvesalgorithm(Tibshirani1992)introducesagenerative 

Thetechniqueofparametrizedself-organizingmaps(PSOMs)involvesrstttingastandardSOM bytheEMalgorithm.However,thenumberofGaussiancomponentsisequaltothenumberof 

modeltoadatasetandthenndingamanifoldindataspacewhichinterpolatesthereference aderivative-basedregularizationterm. datapoints,andsmoothingisimposedbypenalizingthelikelihoodfunctionwiththeadditionof 

vectors(Ritter1993).Althoughthisdenesacontinuousmanifold,theinterpolatingsurface inSection4.2,remain. TheSOMhasalsobeenusedforvectorquantization.Inthiscontextithasbeenshownhow doesnotformpartofthetrainingalgorithm,andthebasicproblemsinusingSOM,discussed 

inlatentspacetoacomplexdistributionindataspacebypropagationthroughanon-linearnetwork. Finally,the`densitynetwork'modelofMacKay(1995)involvestransformingasimpledistribution are-formulationofthevectorquantizationproblem(Luttrell1990;BuhmannandKuhnel1993; 

Adiscretedistributioninlatentspaceisagainused,whichisinterpretedasanapproximateMonte Luttrell1994)canavoidmanyoftheproblemswiththeSOMprocedurediscussedearlier. 

areadaptedusingEM. 

regularratherthanstochastic,aspecicformofnon-linearityisused,andthemodelparameters Carlointegrationoverthelatentvariablesneededtodenethedataspacedistribution.GTM canbeseenasaparticularinstanceofthisframeworkinwhichthesamplingoflatentspaceis

GTM:TheGenerativeTopographicMapping 6 Discussion 14 

Inthispaperwehaveintroducedaformofnon-linearlatentvariablemodelwhichcanbetrained ecientlyusingtheEMalgorithm.Viewedasatopographicmappingalgorithm,ithasthekey propertythatitdenesaprobabilitydensitymodel. problemofdealingwithmissingvaluesinthedataset(inwhichsomecomponentsofthedata vectorstnareunobserved).Ifthemissingvaluesaremissingatrandom(LittleandRubin1987) Asanexampleofthesignicanceofhavingaprobabilitydensity,considertheimportantpractical thenthelikelihoodfunctionisobtainedbyintegratingouttheunobservedvalues.FortheGTM 

amixtureofGTMmodels.Inthiscasetheoveralldensitycanbewrittenas modeltheintegrationscanbeperformedanalytically,leadingtoasimplemodicationoftheEM algorithm. Afurtherconsequenceofhavingaprobabilisticapproachisthatitisstraightforwardtoconsider 

mixingcoecientssatisfying0 wherep(tjr)representstherthmodel,withitsownsetofindependentparameters,andP(r)are P(r) p(t)=XrP(r)p(tjr) 1andPrP(r)=1.Again,itisstraightforwardto (28) 

TheGTMalgorithmcanbeextendedinotherways,forinstancebyallowingindependentmix- extendtheEMalgorithmtomaximizethecorrespondinglikelihoodfunction. 

exponentialappliedtoageneralizedlinearregressionmodel,althoughinthiscasetheM-stepof theEMalgorithmwouldinvolvenon-linearoptimization.Similarly,theinversenoisevariance rameters,theicanbedeterminedassmoothfunctionsofthelatentvariablesusinganormalizedingcoecientsi(priorprobabilities)foreachoftheGaussiancomponents,whichagaincanbeestimatedbyastraightforwardextensionoftheEMalgorithm.Insteadofbeingindependentpa- smoothmanifoldindataspace,whichallowsthelocal`magnicationfactor'betweenlatentand dataspacetobeevaluatedasafunctionofthelatentspacecoordinatesusingthetechniquesof dierentialgeometry(Bishop,Svensen,andWilliams1996b).Finally,sincethereisawell-dened likelihoodfunction,itisstraightforwardinprincipletointroducepriorsoverthemodelparameters canbegeneralizedtoafunctionofx.AnimportantpropertyofGTMistheexistenceofa 

(asdiscussedinSection2.1)andtouseBayesiantechniquesinplaceofmaximumlikelihood. bemoreconvenienttoconsidersequentialadaptationinwhichdatapointsarepresentedoneata ThroughoutthispaperwehavefocussedonthebatchversionoftheGTMalgorithminwhichall ofthetrainingdataareusedtogethertoupdatethemodelparameters.Insomeapplicationsitwill time.Sinceweareminimizingadierentiablecostfunction,givenby(6),asequentialalgorithm canbeobtainedbyappealingtotheRobbins-Monroprocedure(RobbinsandMonro1951;Bishop AwebsiteforGTMisprovidedat: algorithmcanbeused(Titterington,Smith,andMakov1985). 1995)tondazerooftheobjectivefunctiongradient.Alternatively,asequentialformoftheEM 

whichincludespostscriptlesofrelevantpapers,asoftwareimplementationinMatlab(aCimplementationisunderdevelopment),andexampledatasetsusedinthedevelopmentoftheGTM http://www.ncrg.aston.ac.uk/GTM/ 

algorithm.

GTM:TheGenerativeTopographicMapping Acknowledgements 15 

High-DimensionalData.WewouldliketothankGeoreyHinton,IainStrachanandMichael ThisworkwassupportedbyEPSRCgrantGR/K51808:NeuralNetworksforVisualizationof inStockholmfortheirhospitalityduringpartofthisproject. Tippingforusefuldiscussions.MarkusSvensenwouldliketothankthestaoftheSANSgroup 

References Bishop,C.M.(1995).NeuralNetworksforPatternRecognition.OxfordUniversityPress. 

Bishop,C.M.,M.Svensen,andC.K.I.Williams(1996a).AfastEMalgorithmforlatentvariable Bishop,C.M.andG.D.James(1993).Analysisofmultiphaseowsusingdual-energy ResearchA327,580{593. gammadensitometryandneuralnetworks.NuclearInstrumentsandMethodsinPhysics 

Bishop,C.M.,M.Svensen,andC.K.I.Williams(1996b).MagnicationfactorsfortheGTM densitymodels.InD.S.Touretzky,M.C.Mozer,andM.E.Hasselmo(Eds.),Advancesin NeuralInformationProcessingSystems,Volume8,pp.465{471.MITPress. 

Buhmann,J.andK.Kuhnel(1993).Vectorquantizationwithcomplexitycosts.IEEETransac- algorithm.ToappearinProceedingsFifthIEEInternationalConferenceonArticialNeural 

Dempster,A.P.,N.M.Laird,andD.B.Rubin(1977).Maximumlikelihoodfromincomplete Networks. tionsonInformationTheory39(4),1133{1145. 

Erwin,E.,K.Obermayer,andK.Schulten(1992).Self-organizingmaps:ordering,convergence Durbin,R.andD.Willshaw(1987).Ananalogueapproachtothetravellingsalesmanproblem. Nature326,689{691. dataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety,B39(1),1{38. 

Hastie,T.andW.Stuetzle(1989).Principalcurves.JournaloftheAmericanStatisticalAssoHinton,G.E.,C.K.I.Williams,andM.D.Revow(1992).Adaptiveelasticmodelsforhand- propertiesandenergyfunctions.BiologicalCybernetics67,47{55. ciation84(406),502{516.printedcharacterrecognition.InJ.E.Moody,S.J.Hanson,andR.P.Lippmann(Eds.),AdvancesinNeuralInformationProcessingSystems,Volume4,pp.512{519.MorganKau- Kohonen,T.(1995).Self-OrganizingMaps.Berlin:Springer-Verlag. Kohonen,T.(1982).Self-organizedformationoftopologicallycorrectfeaturemaps.Biological Cybernetics43,59{69. mann. 

Little,R.J.A.andD.B.Rubin(1987).StatisticalAnalysiswithMissingData.NewYork:John LeBlanc,M.andR.Tibshirani(1994).Adaptiveprincipalsurfaces.JournaloftheAmerican StatisticalAssociation89(425),53{64. 

Luttrell,S.P.(1994).ABayesiananalysisofself-organizingmaps.NeuralComputation6(5), Luttrell,S.P.(1990).Derivationofaclassoftrainingalgorithms.IEEETransactionsonNeural Networks1(2),229{232. Wiley. 

MacKay,D.J.C.(1995).Bayesianneuralnetworksanddensitynetworks.NuclearInstruments Mardia,K.,J.Kent,andM.Bibby(1979).Multivariateanalysis.AcademicPress. 

767{794. andMethodsinPhysicsResearch,A354(1),73{80.

GTM:TheGenerativeTopographicMapping Mulier,F.andV.Cherkassky(1995).Self-organizationasaniterativekernelsmoothingprocess. 16 

Nadaraya,E.A.(1964).Onestimatingregression.TheoryofProbabilityanditsApplica- Ritter,H.(1993).Parametrizedself-organizingmaps.InProceedingsICANN'93International NeuralComputation7(6),1165{1177. tions9(1),141{142. 

Tibshirani,R.(1992).Principalcurvesrevisited.StatisticsandComputing2,183{190. Robbins,H.andS.Monro(1951).Astochasticapproximationmethod.AnnalsofMathematical ConferenceonArticialNeuralNetworks,Amsterdam,pp.568{575.Springer-Verlag. 

Titterington,D.M.,A.F.M.Smith,andU.E.Makov(1985).StatisticalAnalysisofFinite Statistics22,400{407. 

Watson,G.S.(1964).Smoothregressionanalysis.Sankhya:TheIndianJournalofStatistics. SeriesA26,359{372. 

MixtureDistributions.NewYork:JohnWiley.

GTM: The Generative Topographic Mapping - Aston University ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?