21.06.2013 Views

JAM: Java agents for Meta-Learning over Distributed Databases

JAM: Java agents for Meta-Learning over Distributed Databases

JAM: Java agents for Meta-Learning over Distributed Databases

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>JAM</strong>:<strong>Java</strong><strong>agents</strong><strong>for</strong><strong>Meta</strong>-<strong>Learning</strong><strong>over</strong><strong>Distributed</strong><strong>Databases</strong><br />

(sal,andreas,sat,wenke,wfan@cs.columbia.edu) SalvatoreStolfo,AndreasL.Prodromidisy ShelleyTselepis,WenkeLee,WeiFan DepartmentofComputerScience NewYork,NY10027 ColumbiaUniversity<br />

FloridaInstituteofTechnology Melbourne,FL32901 ComputerScience PhilipK.Chan (pkc@cs.fit.edu) March5,1997<br />

<strong>agents</strong><strong>for</strong>combiningmultiplemodelsthatwerelearned(perhaps)atdierentsites.It ingapplicationsthatwehavecometocallmeta-learning.<strong>JAM</strong>providesasetofagent-baseddataminingsystemthatemploysageneralapproachtoscalingdatamin- models<strong>over</strong>datastoredlocallyatasite.<strong>JAM</strong>alsoprovidesasetofmeta-learning learningprograms,implementedeitherasJAVAappletsorapplications,thatcompute Inthispaper,wedescribethe<strong>JAM</strong>system,adistributed,scalableandportable Abstract<br />

modelsorclassier<strong>agents</strong>tootherremotesites.Wedescribethe<strong>over</strong>allarchitec- employsaspecialdistributionmechanismwhichallowsthemigrationofthederived tureofthe<strong>JAM</strong>systemandthespecicimplementationcurrentlyunderdevelopment atColumbiaUniversity.Oneof<strong>JAM</strong>'stargetapplicationsisfraudandintrusion detectioninnancialin<strong>for</strong>mationsystems.Abriefdescriptionofthislearningtask and<strong>JAM</strong>'sapplicabilityarealsodescribed.Interestedusersmaydownload<strong>JAM</strong>from<br />

YorkStateScienceandTechnologyFoundationundergrantPolytechnic423115-445. CISEResearchInfrastructureGrantProgramoftheNationalScienceFoundationundergrantCDA-96-25374 and,theCenter<strong>for</strong>AdvancedTechnologyatPolytechnicUniversity(notColumbiaUniversity)oftheNew ModelsandCognitiveSystemsProgramsoftheNationalScienceFoundationundergrantIRI-96-32225,the ResearchProjectsAgencyundergrantF30602-96-1-0311,theDatabaseandExpertSystemsandKnowledge ThisresearchissupportedbytheIntrusionDetectionProgram(BAA9603)oftheDefenseAdvanced http://www.cs.columbia.edu/sal/<strong>JAM</strong>/PROJECT.<br />

ySupportedbyIBM


1Introduction<br />

stochasticmodels,toalgorithmsbaseduponpurelysymbolicdescriptionslikerulesanddeci- Onemeansofacquiringnewknowledgefromdatabasesistoapplyvariousmachinelearning maybeexhibitedinthedata.Theeldofmachinelearninghasmadesubstantialprogress applicationsindiverseelds.Therearenumerousalgorithmsrangingfromthosebasedupon <strong>over</strong>theyearsandanumberofalgorithmshavebeenpopularizedandappliedtoahostof algorithmsthatcomputedescriptiverepresentationsofthedataaswellaspatternsthat<br />

Indeed,dothecurrentgenerationofmachinelearningalgorithmsscalefromtaskscommon todaythatincludethousandsofdataitemstonewlearningtasksencompassingasmuchas twoordersofmagnitudeormoreofdatathatisphysicallydistributed?Furthermore,many largedatabasesandwait<strong>for</strong>aresponse!However,thequestionishowlongmightwewait? existinglearningalgorithmsrequireallthedatatoberesidentinmainmemory,whichis clearlyuntenableinmanyrealisticdatabases.Incertaincases,dataisinherentlydistributed siontrees.Thus,wemaysimplyapplythecurrentgenerationoflearningalgorithmst<strong>over</strong>y<br />

norfeasible,toinspectallofthedataatoneprocessingsitetocomputeoneprimary\global" classier.Wecalltheproblemoflearningusefulnewknowledgefromlargeinherentlydis- andcannotbelocalizedonanyonemachine(evenbyatrustedthirdparty)<strong>for</strong>avariety tributeddatabasesthescalingproblem<strong>for</strong>machinelearning.Weproposetosolvethescaling aswellasstatutoryconstraintsimposedbylaw.Insuchsituations,itmaynotbepossible, ofpracticalreasonsincludingphysicallydispersedmobileplat<strong>for</strong>mslikeanarmadaofships, securityandfaulttolerantdistributionofdataandservices,competitive(business)reasons, problembywayofatechniquewehavecometocall\meta-learning".<strong>Meta</strong>-learningseeks computea\meta-classier"thatintegratesinsomeprincipledfashiontheseparatelylearned classierstoboost<strong>over</strong>allpredictiveaccuracy. tocomputeanumberofindependentclassiersbyapplyinglearningprogramstoacollectionofindependentandinherentlydistributeddatabasesinparallel.The\baseclassiers" socomputedarethenintegratedbyanotherlearningprocess.Heremeta-learningseeksto<br />

powerfulportableandextensiblenetworkagent-basedsystemthatcomputesmeta-classiers <strong>over</strong>distributeddata.<strong>JAM</strong>isbeingengagedinexperimentsaddressingreal-worldlearning theinternet.Insection3,wedetailthe<strong>JAM</strong>(<strong>Java</strong>Agents<strong>for</strong><strong>Meta</strong>-<strong>Learning</strong>)architecture,a developingmeta-learningsystemsandapplythesetechniquestoarangeoflarge-scaledisplishingthistaskbywayofmeta-learningaspreviouslyreportedin[4].Weseektocontinuetributedapplicationsbyutilizingexistingagent-basedinfrastructures<strong>for</strong>deployment<strong>over</strong>Inthefollowingpageswepresentasummary<strong>over</strong>viewofavarietyofwaysofaccom- systems.Sections4and5describethiseort.Insection6wediscussourfutureresearch andsection7concludesthepaper. 2<strong>Meta</strong>-<strong>Learning</strong> taskssuchassolvingkeyproblemsinfraudandintrusiondetectioninnancialin<strong>for</strong>mation<br />

<strong>for</strong>arangeofdierentapplications.<strong>Meta</strong>-learningisproposedastheunifyingapproach. Wedesireaunifyingandscalablesolutionthatimprovestheeciencyandaccuracyof inductivelearningwhenappliedtolargeamountsofdatainwideareacomputingnetworks 1


thesameserialcodeatmultiplesiteswithoutthetime-consumingprocessofwritingparallel programsandsecond,thelearningprocessusessmallsubsetsofdatathatcantinmain implementedasadistinctserialprogram)onanumberofdatasubsets(adatareduction thecollectiveresultsthroughmeta-learning.Thisapproachhastwoadvantages,rstituses technique)inparallel(eg.<strong>over</strong>anetworkofseparateprocessingsites)andthentocombine memory.Theaccuracyofthelearnedconceptsbytheseparatelearningprocessmightbe Ourapproachtoimproveeciencyistoexecuteanumberoflearningprocesses(each<br />

lowerthanthatoftheserialversionappliedtotheentiredatasetsinceaconsiderableamount ofin<strong>for</strong>mationmaynotbeaccessibletoeachoftheindependentandseparatelearningpro- achieveaccuracylevels,comparabletothatreachedbythea<strong>for</strong>ementionedserialversion cesses.Ontheotherhand,combiningthesehigherlevelconceptsviameta-learning,may appliedtotheentiredataset.Furthermore,thisapproachmayuseavarietyofdierent learningalgorithmsondierentcomputingplat<strong>for</strong>ms.Becauseoftheproliferationofnetworksofworkstationsandthegrowingnumberofnewlearningalgorithms,ourapproachdoesnotrelyonanyspecicparallelordistributedarchitecture,noronanyparticularalgolicationshavereportedper<strong>for</strong>manceresultsonstandardtestproblemsanddatasetswithbitration,combining[3]andhierarchicaltree-structuredmeta-learningsystems.Otherpubrithm,andthusdistributedmeta-learningmayaccommodatenewsystemsandalgorithmscombining[10]tonameafew.Weshallnotrepeatthisexpositioninthispaper.Herewede- relativelyeasily.Ourmeta-learningapproachisintendedtobescalableaswellasportable<br />

scribethe<strong>JAM</strong>systemarchitecturedesignedtosupporttheseandperhapsotherapproaches discussionsofrelatedtechniques,Wolpert'sstacking[9],Breiman'sbagging[1]andZhang's andextensible.<br />

todistributeddatamining. Inpriorpublicationsweintroducedanumberofmeta-learningtechniquesincludingar-<br />

togetherthroughanetworkofDatasites.Each<strong>JAM</strong>Datasiteconsistsof: supportsthelaunchingoflearningandmeta-learning<strong>agents</strong>todistributeddatabasesites. <strong>JAM</strong>isimplementedasacollectionofdistributedlearningandclassicationprogramslinked designedasanextensionofOSenvironments.Itisadistributedmeta-learningsystemthat 3The<strong>JAM</strong>architecture <strong>JAM</strong>isarchitecturedanagentbasedsystem,adistributedcomputingconstructthatis Alocaldatabase, Alearningagent,amachinelearningprogramthatmaymigratetoothersitesasa JAVAapplet,orbelocallystoredasanativeapplicationcallablebyaJAVAapplet, Ameta-learningagent, Alocalusercongurationle, GraphicalUserInterfaceandAnimationfacilities. 2


combinethesewithitsownlocalclassierusingthelocalmeta-learningagent.Theseactions classiers.EachDatasitemaythenimport(remote)classiersfromitspeerDatasitesand maytakeplaceatallDatasitessimultaneouslyandindependently. classier<strong>agents</strong>thatarecomputedbythelearning<strong>agents</strong>. The<strong>JAM</strong>Datasiteshavebeendesignedtocollaborate1witheachothertoexchange First,locallearning<strong>agents</strong>operateonthelocaldatabaseandcomputetheDatasite'slocal<br />

parameters(be<strong>for</strong>ethebeginningofthelearningandmeta-learningtasks),theownerofthe Datasitecanalsoemploy<strong>JAM</strong>'sgraphicaluserinterfaceandanimationfacilitiestosupervise databasestobeused,thepolicytopartitionthesedatabasesintotrainingandtestingsubsets, thelocallearning<strong>agents</strong>tobedispatched,etc.Besidesthestaticspecicationofthelocal per<strong>for</strong>mthelearningandmeta-learningtasks.Suchparametersincludethenamesofthe le.Throughthisle,he/shecanspecifytherequiredandoptionallocalparametersto TheownerofaDatasiteadministersthelocalactivitiesviathelocaluserconguration<br />

integratenewknowledgethathasbecomeavailable,orevendiscardobsoleteclassiers.Or, logsandcompareandanalyzeresultsinordertoimproveper<strong>for</strong>mance. he/shecanuse<strong>JAM</strong>'spresentationtoolstoinspectthegenerateddecisiontreeclassiers agentexchangesandadministerdynamicallythemeta-learningprocess.Withthisgraphical interface,theownermayaccessmorein<strong>for</strong>mationsuchasaccuracy,trends,statisticsand directlytogainvaluableintuition.thatmaybecomputed,<strong>for</strong>examplebyID3. Forexample,theownermaystudyresultsanddecidetorepeatthelearningprocessor<br />

Manager(CFM),acentralandindependentmoduleresponsible<strong>for</strong>keepingthestateofthe systemup-to-date.TheCFMisasaserverthatprovidesin<strong>for</strong>mationabouttheparticipating executionofthesemodulestoclassifyandlabeldatasetsofinterest. Datasitesandlogsevents<strong>for</strong>futurereferenceandevaluation. ThecongurationofthedistributedsystemismaintainedbytheCongurationFile Finally,oncethebaseandmeta-classiersarecomputed,the<strong>JAM</strong>systemmanagesthe<br />

thisexample,theCFMrunsonCherryandeachDatasiteendsupwiththreebaseclassiers (onelocalplusthetworemoteclassiers). tobeused,theimagestobeusedbytheanimationfacility,thefoldingparameters,etc.In thelearningtaskbysettingtheparametersoftheusercongurationle,i.e.thealgorithms thisexample,three<strong>JAM</strong>DatasitesMarmalade,MangoandStrawberryexchangetheirbase classierstosharetheirlocalviewofthelearningtask.TheowneroftheDatasitecontrols Thelogicalarchitectureofthe<strong>JAM</strong>meta-learningsystemispresentedinFigure1.In<br />

programs;andthiswasdone<strong>for</strong>fasterprototypedevelopmentandproofofconcept.The themunderremoteorlocalcontrol.Thegraphicaluserinterface,theanimationfacilities <strong>agents</strong>.JAVAtechnologyprovidesthemeanstodispatch<strong>agents</strong>toremotesitesandexecute partsthatwereimportedintheirnative(C++)<strong>for</strong>mweresomeofthemachinelearning thespecicagentoperatorsthatcomposeandspawnnew<strong>agents</strong>fromexistingclassier andmostofthemachinelearningalgorithmswerealsoimplementedinJAVA.Theonly WehaveusedJAVAtechnologytobuildtheinfrastructureofthesystemanddeveloped<br />

Theplat<strong>for</strong>m-independenceofJAVAtechnologymakesiteasytoport<strong>JAM</strong>anddelegateits <strong>JAM</strong>systembuildsupontheexistingagentinfrastructureavailable<strong>over</strong>theinternettoday. 1ADatasitemayalsooperateindependentlywithoutanychanges. 3


DATA SITES:<br />

Marmalade.cs<br />

Control & Data<br />

+ Mango.cs<br />

Figure1:Thearchitectureofthemeta-learningsystem.<br />

messages<br />

+ Strawberry.cs<br />

Configuration<br />

Mango.cs<br />

Transfer of <strong>Learning</strong><br />

Configuration File Manager<br />

+ Strawberry.cs<br />

& Classifier Agents<br />

Database<br />

+ Marmalade.cs<br />

Cherry.cs.columbia.edu<br />

Strawberry.cs<br />

+ Marmalade.cs<br />

+ Mango.cs<br />

Data Site - 1<br />

Data Site - 3<br />

Marmalade.cs<br />

Strawberry.cs<br />

Data Site - 2<br />

TheCFMassumesaroleequivalenttothatofanameserverofanetworksystem.Itis yetplat<strong>for</strong>mindependent.) 3.1CongurationFileManager <strong>agents</strong>toanyparticipatingsite.(ThemodulesthatareimplementedinnativeC++arenot<br />

Configuration<br />

Configuration<br />

Datasite<br />

Mango.cs<br />

Datasite<br />

File<br />

Database<br />

File<br />

Database<br />

CFM = Cherry.cs.columbia.edu<br />

DATASET = thyroid<br />

LEARNER = ID3<br />

totheparticipatingDatasites. responsible<strong>for</strong>maintainingthe\global"congurationofthesystemandmakingitavailable<br />

META_LEARNER = Bayes<br />

Configuration<br />

Datasite<br />

CROSS_VALIDATION_FOLD = 2<br />

Database<br />

File<br />

META_LEARNING_FOLD = 2<br />

META_LEARNING_LEVEL = 1<br />

TheCFMprovidesregistrationservicestoallDatasitesthatwishtobecomemembers<br />

IMAGE_URL = http://www.cs....<br />

The <strong>JAM</strong> architecture with 3 datasites<br />

andparticipateinthedistributedmeta-learningactivity.WhentheCFMreceivesaJOIN Datasiteasinactiveandremovesitfromitslistofmembers.TheCFM,maintainsthelist ofactivememberDatasitestoestablishcontactandcooperationbetweenpeerDatasites. Apartfromthat,theCFMkeepsin<strong>for</strong>mationregardingthegroupsthatare<strong>for</strong>med(which requestfromanewDatasite,itveriesboththevalidityoftherequestandtheidentityof Similarly,theCFMcanreceiveandverifytheDEPARTURErequest;itnotestherequestor DatasitescollaboratewithwhichDatasites),logstheeventsanddisplaysthestatusofthe theDatasite.Uponsuccess,itacknowledgestherequestandregisterstheDatasiteasactive.<br />

participate. 3.2Datasites UnlikeCFMwhichprovidesapassivecongurationmaintainancefunction,theDatasitesare theactivecomponentsofthemeta-learningsystem. system.ThroughtheCFM,the<strong>JAM</strong>systemadministratormayscreentheDatasitesthat<br />

interactswitha<strong>JAM</strong>user.ADatasiteisimplementedasamultithreaded<strong>Java</strong>program withaspecialGUI. database,buildslocalclassiers,obtainsremoteclassiers,buildslocalmetaclassiersand TheDatasitesareresponsible<strong>for</strong>runningtheshow.ADatasitemanagesitslocal 4


socket<strong>for</strong>listening<strong>for</strong>connections3fromthepeerDatasites. completetheDatasitewaits<strong>for</strong>thenexteventtooccur.Thiscanbeeither registerswiththeCFM,instantiatesthelocallearningengine/agent2andcreatesaserver displaystatusandresults.Duringitsinitialization,amongitsothertasks,theDatasite TheDatasiteisamoduledrivenbyinputmessagesorcommands.Afterinitializationis Uponinitialization,aDatasitestartsuptheGUIthroughwhichitcanacceptinputand<br />

isestablished,theDatasiteallocatesaseparatethreadandper<strong>for</strong>mstherequiredtask.This taskcanbeanyof<strong>JAM</strong>'sfunctions:computingalocalclassier,startingthemeta-learning process,sendinglocalclassierstopeerDatasitesorrequestingremoteclassiersfromthem, 1.AcommandissuedbytheownerviatheGUI,or<br />

reportingthecurrentstatus,orpresentatingcomputedresults. 2.AmessagefromapeerDatasiteviatheopensocket. Inbothcases,theDatasiteveriesthattheinputisvalidandcanbeserviced.Oncethis<br />

roidism.Thesnapshottakenisfrom\Marmalade'spointofview".Initially,Marmalade consultstheDatasitecongurationlewheretheowneroftheDatasitesetstheparameters. Inthiscase,thedatasetisamedicaldatabasewithrecords,notedbythyroidintheData examplethreeDatasites,Marmalade,StrawberryandMango(seethegrouppanelofthe Setpanel.OtherparametersincludethehostoftheCFM,theCross-ValidationFold,the gure)collaborateinordertoshareandimprovetheirknowledgeindiagnosinghypothy- Figure2presentsasnapshotofthe<strong>JAM</strong>systemduringthemeta-learningphase.Inthis<br />

itspeerDatasites,havingacquiredthisin<strong>for</strong>mationfromtheCFM.) theseparameters.(NoticethatMarmaladehasestablishedthatStrawberryandMangoare localmeta-learningagent,etc.Referto[2]<strong>for</strong>morein<strong>for</strong>mationonthemeaninganduseof <strong>Meta</strong>-<strong>Learning</strong>Fold,the<strong>Meta</strong>-<strong>Learning</strong>Level,thenamesofthelocallearningagentandthe<br />

andMango.1andbeginsthemeta-learningprocess.ThesnapshotofFigure2displaysthe systematthisstage.Intheanimatedmeta-learningprocess<strong>JAM</strong>'sGUImovesiconswithin thepaneldisplayingtheconstructionofanewmeta-classier.Marmaladewillusethis classier,notedbyMarmalade.1(herebycallingtheID3learningagent)viewedatthemain panel.Next,MarmaladeMarmaladeimportstheremoteclassiers,notedbyStrawberry.1 roid.2.bldintheDataSetpanel)<strong>for</strong>the2-Cross-ValidationFold,andcomputesafterthelocalThen,Marmaladepartitionsthethyroiddatabase(notedasthyroid.1.bldandthy- thelearningandmetalearningprocess.He/Shecanobservetheinternalsofthegenerated meta-classierinthefuturetopredicttheclassesofinputdataitems(inthiscaseunlabelled medicalrecords). classiersandmetaclassiersandgetreportsontheresultsandstatistics. obtainin<strong>for</strong>mationregardingthecentralCFMandthetypesoftheavailablemachinelearning<strong>agents</strong>. 3Foreachconnection,theDatasitespawnsaseparatethread. 2TheDatasiteconsultsthelocalDatasitecongurationle(maintainedbytheowneroftheDatasite)to Theownerofa<strong>JAM</strong>Datasitehasdirectcontrol<strong>over</strong>thestagesandtheprogressof<br />

5


displayedintheClassierVisualizationPanel <strong>JAM</strong>providesgraphdrawingtoolstohelpusersunderstandthelearnedknowledge[7].There 3.3ClassierVisualization themetaclassier(metalearningstage).Right:AID3tree-structuredclassierisbeing Figure2:Twodierentsnapshotsofthe<strong>JAM</strong>systeminaction.Left:Marmaladeisbuilding<br />

leafnodesrepresentclasses(decisions),thenon-leafnodesrepresenttheattributesunder translatortoreadtheclassierandgeneratea<strong>Java</strong>Dotgraphrepresentation. todisplaytheclassierandallowstheusertoanalyzethegraph.Sinceeachmachinelearning algorihtmhasitsown<strong>for</strong>mattorepresentthedataclassier,<strong>JAM</strong>usesanalgorithm-specic aremanykindsofclassiers,e.g.,adecisiontreebyID3,thatcanberepresentedasgraphs.In<br />

test,andtheedgesrepresenttheattributevalues.Theusercanselectthe\Attributes" <strong>JAM</strong>wehaveemployedmajorcomponentsof<strong>Java</strong>Dot[8],anextensiblevisualizationsystem,<br />

commandfromthe\Object"pull-downmenutoseeanyadditionalin<strong>for</strong>mationabouta nodeoranedge.Inthegure,the\Attributes"windowshowstheclassifyingin<strong>for</strong>mationof thehighlightedleafnode4.Itisdiculttoviewclearlyaverylargegraph(thathasalarge numberofnodesandedges)duetothelimitedwindowsize.Theclassiervisualizationpanel Figure2showsthe<strong>JAM</strong>classiervisualizationpanelwithadecisiontree,wherethe<br />

toviewtheenclosinggraph;andusethe\Root"commandtoseetheentireoriginalgraph. startingfromtheselectednodebetheentiregraphindisplay;usethe\Parent"command selectanodeandusethe\Top"commandfromthe\Graph"menutomakethesubgraph providescommands<strong>for</strong>theusertotraverseandanalyzepartsofthegraph:theusercan<br />

displaysitintheclassiervisualizationpanel. e.g.,therulesetsfromRipper[6].Itisthuscounter-intuitivetotranslatethetexttograph <strong>for</strong>m<strong>for</strong>displaypurposes.Insuchcases,<strong>JAM</strong>simplypretty<strong>for</strong>matsthetextoutputand belongstoclass\0"with.889probability. 4Thusvisually,weseethat<strong>for</strong>atestdataitem,ifits\p-2"valueis3andits\p-14"valueis2,thenit Somemachinelearningalgorithmsgenerateconciseandveryreadabletextualoutputs,<br />

6


userinterfacecontainsacollectionofanimationpanelswhichvisuallyillustratethestagesof 3.4Animation Fordemonstrationanddidacticpurposes,themeta-learningcomponentofthe<strong>JAM</strong>graphical meta-learninginparallelwithexecution.Whenanimationisenabled,atransitionintoanew stageofcomputationoranalysistriggersthestartoftheanimationsequencecorresponding totheunderlyingactivity.Theanimationloopscontinuouslyuntilthegivenactivityceases. learningstage(byclickinga\Next"button),orsendingtheprocessintoautomaticexecution disabledandexecutionsettoautomatictransitiontothenextstageintheprocess. 3.5Agents halt.For\handsfree"operationof<strong>JAM</strong>,theusercanstarttheprogramwithanimation (byclickinga\Continue"button).Themanualrunoptionprovidesatemporaryprogram The<strong>JAM</strong>programgivestheusertheoptionofmanuallyinitiatingeachdistinctmeta-<br />

allsubclasseshavetocomplyto.Aslongasalearningormeta-learningagentcon<strong>for</strong>msto <strong>agents</strong>ubclasses,theparentagentclassprovidesaverysimpleandminimalinterfacethat initionoftheparentagentclassandeveryinstanceagent(i.e.aprogramthatimplements anyofyourfavoritelearningalgorithmsID3,CART,BAYES,WPEBLS,etc.)arethen denedasasubclassofthisparentclass.Amongotherdenitionswhichareinheritedbyall <strong>JAM</strong>'sextensibleplug-and-playarchitectureallowssnap-inlearning<strong>agents</strong>.<br />

thisinterface,itcanbeintroducedandusedimmediatelyinthe<strong>JAM</strong>systemevenduring Thelearningandmeta-learning<strong>agents</strong>aredesignedasobjects.<strong>JAM</strong>providesthedef-<br />

execution. <strong>JAM</strong>touseiteectively: 2.Aninitialize()method.Inmostofthecases,ifnotall,the<strong>agents</strong>ubclassesinherit 1.Aconstructormethodwithnoarguments.<strong>JAM</strong>cantheninstantiatetheagent,pro- Tobemorespecic,a<strong>JAM</strong>agentneedstohavethefollowingmethodsimplemented<strong>for</strong><br />

testdatasets,thenameofthedictionaryle,andthelenameoftheoutputclassier. necessaryargumentstotheagent.Argumentsincludethenamesofthetrainingand eitherthelocalusercongurationleortheGUI). thismethodfromtheparentagentclass.Throughthismethod,<strong>JAM</strong>cansupplythe videditknowsitsname(whichcanbesuppliedbytheowneroftheDatasitethrough<br />

4.AgetClassier()andgetCopyOfClassier()methods.Thesemethodsareusedby<strong>JAM</strong> 3.AbuildClassier()method.<strong>JAM</strong>callsthismethodtotriggertheagenttolearn(or toobtainthenewlybuiltclassifers.Thesearethenencapsulatedandcanbe\snappedin"atanyotherparticipatingDatasite!Hence,remoteagentdispatchiseasilyaccom meta-learn)fromthetrainingdataset.<br />

sentedinFigure3.ID3,Bayes,WpeblsandRipperinheritthemethodsinitialize()andTheclasshierarchy(onlymethodsareshown)<strong>for</strong>fourdierentlearning<strong>agents</strong>ispreplished. 7


Learner<br />

getClassier()fromtheirparentlearningagentclass.The<strong>Meta</strong>-<strong>Learning</strong>,Classierand<br />

Learner(),<br />

boolean initialize(String dbName, ...)<br />

<strong>Meta</strong>-Classierclassesaredenedinsimilarhierarchies. Figure3:Theclasshierarchyoflearning<strong>agents</strong>.<br />

boolean BuildClassifier()<br />

Classifier getCopyOfClassifier()<br />

Classifier getClassifier() {<br />

return classifier;<br />

}<br />

makes<strong>JAM</strong>trulypowerfulandextensibledataminingfacility. interfacesalready)itcanbeimportedanduseddirectly.Thisplug-and-playcharacteristic interest.Aslongasamachinelearningprogramisdenedandencapsulatedasanobject con<strong>for</strong>mingtotheminimalinterfacerequirements(mostexistingalgorithmshavesimilar <strong>JAM</strong>isdesignedandimplementedindependentlyofthemachinelearningprogramsof<br />

ID3Learner BayesLearner WpeblsLearner RipperLearner<br />

ID3Learner()<br />

BayesLearner()<br />

WpeblsLearner()<br />

RipperLearner()<br />

boolean BuildClassifier()<br />

boolean BuildClassifier()<br />

boolean BuildClassifier()<br />

boolean BuildClassifier()<br />

Classifier getCopyOfClassifier() Classifier getCopyOfClassifier() Classifier getCopyOfClassifier() Classifier getCopyOfClassifier()<br />

4FraudandIntrusionDetection<br />

Decision Tree<br />

Probabilistic<br />

Nearest Neighbor<br />

Rule-Based<br />

Asecuredandtrustedinterbankingnetwork<strong>for</strong>electroniccommercerequireshighspeed<br />

approachisrequired,involvingtheperiodicsharingwitheachotherofin<strong>for</strong>mationabout electronictransactionsareasignicantproblem,onethatwillgrowinimportanceasthe numberofaccesspointsinthenation'snancialin<strong>for</strong>mationsystemgrows. totheirownassetbases.Recentlythough,bankshavecometorealizethataunied,global ducttheirbusiness,whilethwartingfraudulenttransactionattemptsbyothers.Fraudulentvericationandauthenticationmechanismsthatallowlegitimateuserseasyaccesstocon- attacks. Financialinstitutionstodaytypicallydevelopcustomfrauddetectionsystemstargeted<br />

computethesemodels. actionbehaviorstoproducemodelsof\probablyfraudulent"transactions.Weuse<strong>JAM</strong>to anomalousorerranttransactionbehaviorsto<strong>for</strong>ewarnofimpendingthreats.Thisapproach requiresanalysisoflargeandinherentlydistributeddatabasesofin<strong>for</strong>mationabouttrans- Thisnewwallofprotectionconsistsofpattern-directedinferencesystemsusingmodelsof Thekeydicultiesinthisapproachare:nancialcompaniesdon'tsharetheirdata<strong>for</strong> Wehaveproposedanotherwalltoprotectthenation'snancialsystemsfromthreats.<br />

transactionbehaviorarehugeandgrowingrapidly;real-timeanalysisishighlydesirableto anumberof(competitiveandlegal)reasons;thedatabasesthatcompaniesmaintainon updatemodelswhenneweventsaredetectedandeasydistributionofmodelsinanetworked<br />

environmentisessentialtomaintainuptodatedetectioncapability.<br />

8


meta-learning<strong>agents</strong>. Datasite(s),twoormoresuch<strong>agents</strong>maybecomposedintoanewclassieragentby<strong>JAM</strong>'s integratedmeta-learningsystemthatcombinesthecollectiveknowledgeacquiredbyindividuallocal<strong>agents</strong>.Oncederivedlocalclassier<strong>agents</strong>ormodelsareproducedatsome provideintrusiondetectionserviceswithinasinglecorporatein<strong>for</strong>mationsystem,andan <strong>JAM</strong>allowsnancialinstitutionstosharetheirmodelsoffraudulenttransactionsby <strong>JAM</strong>isusedtocomputelocalfrauddetection<strong>agents</strong>thatlearnhowtodetectfraudand<br />

disclosetheirproprietarydata.Inthiswaytheircompetitiveandlegalrestrictionscan exchangingclassier<strong>agents</strong>inasecuredagentinfrastructure.Buttheywillnotneedto<br />

incomingtransaction. ofpossiblyfraudulenttransactionsandthreatsbyinspecting,classifyingandlabelingeach classieragentfromthesharedmodels.Themeta-classiersthenactassentries<strong>for</strong>ewarning constructed,oralternativelyitcanbelocal.Inthelatterguise,eachcorporateentitybenets bemet,buttheycanstillsharein<strong>for</strong>mation.Themeta-learnedsystemcanbeglobally<br />

4.1HowLocalDetectionComponentsWork fromthecollectiveknowledgebyusingitsprivatelyavailabledatatolocallylearnametaConsiderthegenericproblemofdetectingfraudulenttransactions,inwhichwearenotcon-<br />

theirdatabases,DBi,toproduceaclassierfi.Inthesimplestversion,allthefihavethe sameinputfeaturespace.Notethateachfiisjustamappingfromthefeaturesspaceof cernedwithglobalcoordinatedattacks.Weposittherearetwogoodcandidateapproaches. transaction,x,toabimodalfraudlabel. Approach1: i)Eachbanki;1


(adierentone<strong>for</strong>eachbank)ratherthanDB.)EachsuchmappingisFi. // (Thisisexactlyasinstep(iii)ofapproach1,exceptthedatasetused<strong>for</strong>combiningisTi Figure4:Sharingknowledgewithoutsharingdata //<br />

reallyexpectthatFiisallthatmuchbetterinpredictiveaccuracythanaclassiersimply DBcomesfrom,orhowitmightbe<strong>for</strong>med.However,thereisadierentissue-wouldone trainedontheentiresetofavailabledata,DBiSTi?Afterall,theFiarecreatedbylooking solelyatbanki'slocaldata;thefj6=iareinessencejustnewfeaturesbankicanusetolook Thislatterapproachisdepictedingure4.Notethatnowthereisnoissueofwhere iv)EachbankusesitsFiasinstep(iv)ofapproach1.<br />

Classifier<br />

Hereweprovideageneralviewofthedataschema<strong>for</strong>thelabelledtransactiondatasets aredescribednext. 5CreditCardFraudTransactionData atitsdata.Formalstudiesareunderwaytoanswerthisquestion.Somepreliminaryresults<br />

compiledbyabankandusedbyoursystem.Forpurposesofourresearchanddevelopment recordsspanningoneyear,sampling,onaverage,42,000permonth,fromNovember1995to October1996. activity,severaldatasetsarebeingacquiredfromseveralbanks,eachproviding.5million in<strong>for</strong>mationisnotdisclosedhere.(Afterallweseeknottoteach\wanabethieves"important schemaofthisdataisprovidedinsuchawaythatimportantcondentialandproprietary about30numericattributesincludingthebinaryclassication(fraud/legitimatetransac- lessonsonhowtohonetheirskills.)Therecordshaveaxedlengthof137byteseachand ysisbybankpersonneltocaptureimportantin<strong>for</strong>mation<strong>for</strong>frauddetection.Thegeneraltion).Someoftheeldsarearithmeticandtherestcategorical,i.e.numberswereusedtoTheschemaofthedatabasewasdeveloped<strong>over</strong>yearsofexperienceandcontinuousanal- representafewdiscretecategories.Thein<strong>for</strong>mationineachrecordincludes: A(non-revealing)hashedcreditcardaccountnumber. Scoresproducedbyacommercialauthorization/detectionsystem Thedateandtimeofeachtransaction10<br />

Local<br />

Classifier<br />

Remote<br />

Classifier 1<br />

Local<br />

<strong>Meta</strong>classifier<br />

<strong>Meta</strong>-level<br />

Training<br />

Data<br />

Remote<br />

2<br />

Remote<br />

Classifier n


Pastpaymentin<strong>for</strong>mationofthetransactor Geographicin<strong>for</strong>mation,thatis,in<strong>for</strong>mationregardingthelocationswherethetrans- Theamountofthetransaction<br />

Acode<strong>for</strong>otherrecent\non-monetary"transactiontypesper<strong>for</strong>medbythetransactor Anindustrystandardcode<strong>for</strong>thetypeofmerchant Codes<strong>for</strong>thevalidityandthemannerofentryofthetransaction actionwasinitiatedandthelocationofthemerchantandtransactor<br />

Theageoftheaccountandthecard Othercreditcardaccountin<strong>for</strong>mation<br />

determinedseparately,thatprovidespredicitivevalueindeterminingfraudulenttransaction However,eachbankalsoincludesspeciceldscontainingimportantin<strong>for</strong>mation,theyhave patterns.Theintegrationofthisin<strong>for</strong>mationacrossseparatelylearnedclassiersateach Itisinterestingtonotethatmostofthisin<strong>for</strong>mationisindeedcapturedbyeachbank. Thefraudlabel(thetransactionwaseitherfraudulentorlegitimate) CondentialandProprietaryFieldswhichpotentiallycarryotherindicators)<br />

BankB: MethodI:LearnalocalmodelusingPFeldsandexchange.Let'sassumethat banksiteisanon-trivialproblem,calleddataschemaintegrationproblem. thefrauddetectionmodellearnedinBankAincludesthePFelds.Let'salsoassumethat ofthisdiscussion,wedistinguishtwoseparatedatasetsfromtwobankscalledBankAand thedataattheBankBsitedonotincludetheseeldsandhencetheclassiercannotdeal Wedescribetwoapproaches<strong>for</strong>handlingthese\proprietaryelds"(PF).Forpurposes<br />

withthem.Then,BankBwillnotbeabletouseBankA'smodeldirectlyunless: 2.BankBsimplyincludesnullvaluesina\bogus"PFeldaddedtotheBankBdataset. 1.BankB\massages"itsdatatoincludePFvalues.Todothis,BankBmustimport, value<strong>for</strong>BankB.Afterall,BankBdidnotincludetheminitsauthorizationsystem andpresumablyotherattributes(includingthecommonones)dohavepredictivevalue. EventhoughthePFeldsmayhavehighpredictivevalue<strong>for</strong>BankA,theyareofno inallcases,andmaynotbedesirablebyBankA. alongwiththeremoteclassier,asecure/trustedagentfromBankAthatcancompute values<strong>for</strong>themissingPF'sintheBankB'sdata.Thishowevermaynotbepossible<br />

11


thatthedataatBankAincludesomeadditionalPFeldsthatBankB'sdatalack.Inthis totheintersectionoftheeldsofthedatasetsofthetwobanks,impliesthatthesecond laterusebythemeta-learning<strong>agents</strong>,whilethesecondone,withouttheseeldsisexchanged. approach,BankAcanlearntwolocalmodels.OnewiththePFeldsisstoredlocally<strong>for</strong> classiermakesuseonlyoftheattributesthatarecommonamongtheparticipatingsites MethodII:LearnamodelusingPFeldsandholdlocally.Again,weassume<br />

importedbyBankA(andassurednottoinvolvepredictions<strong>over</strong>thePFelds)canstill andnoissueexists<strong>for</strong>itsintegrationatotherBanks.Ontheotherhand,remoteclassiers belocallyintegratedwiththeoriginalmodelthatemployesthePFelds.Inthiscase,the remoteclassierssimplyignorethePFeldsofthelocaldataset. <strong>Learning</strong>asecondclassierwithoutthePFelds,orbetteryet,witheldsthatbelong<br />

5.1Descriptionofthelearningprocess thesemodelsshouldproceedinastraight<strong>for</strong>wardmanner. Inthissection,wedescribethesettingofourexperiments.Inparticular,wesplittheoriginal datasetprovidedbyonebankintorandompartitionsandwedistributedthemacrossthe Bothapproachesaddressthedataschemaintegrationproblemand<strong>Meta</strong>-learning<strong>over</strong><br />

dierentsitesofthe<strong>JAM</strong>network.Thenwecomputedtheaccuracyfromeachmodel obtainedateachsuchpartition. datasetweusedinourexperiments,andkeptthem<strong>for</strong>theValidationandTestsetsto evaluatetheaccuracyoftheresultantdistributedmodels.Thelearningtaskistoidentify sitesofdata(saysites1and2),whiletwoinstancesofRipperareappliedelsewhere(sayat patternsinthe30attributeeldsthatcancharacterizethefraudulentclasslabel. sites3and4),allbeinginitiatedas<strong>Java</strong><strong>agents</strong>.Theresultofthesefourlocalcomputations Tobemorespecic,wesampled84,000recordsfromthetotalof500,000recordsofthe<br />

arefourseparateclassiers,CID3i();i=1;2,andCRipperj();j=3;4thatareeachinvocable as<strong>agents</strong>atarbitrarysitesofcreditcardtransactiondata. Let'sassume,withoutlossofgenerality,thatweapplytheID3learningprocesstotwo<br />

siteandinvokedremotelytoextractdata.Thiscanbeaccomplished<strong>for</strong>exampleusinga usingsay,CRipper3()thecodeimplementingthisclassierwouldbetransmittedtothefth queryofthe<strong>for</strong>m: inFigure5,arelativelysmallsetofrulesthatiseasilycommunicatedamongdistributed sitesasneeded.5Toextractfrauddatafromadistinctfthsiteofdata,oranyothersite, SelectX.*FromCredit-card-dataWhereCRipper3(X:fraudlabel)=1. AsampleRipperRule-BasedClassierlearnedfromthecreditcarddatasetisdepicted<br />

classiedas\notfraud"wouldresultinnoin<strong>for</strong>mationbeingreturnedatall(ratherthan basedentirelyupontheclassicationslearnedatsite3.Noticethatrequestingtransactions inafrauddetectionsystem. implementeddirectlyasadatalterappliedagainstincomingtransactionsataserversite 5Thespeciccondentialattributenamesarenotrevealedhere. Theendresultofthisqueryisastreamofdataaccessedfromsomeremotesource Naturally,theselectexpressionrenderedhereinSQLinthisexamplecanbeinstead<br />

12


learnedfromthepredictionsofthefourbaseclassiers.Inthisportion,onlytheclassiers creditcardtransactions.Right:SampleportionoftheID3decisiontreemeta-classier Figure5:Left:Thissamplerule-basedmodel,c<strong>over</strong>s1365non-fraudulentand290fraudulent<br />

Fraud :-<br />

Fraud :-<br />

site1<br />

a >= 148, a >= 774.<br />

b >= 695.<br />

1 0<br />

Fraud :-<br />

Fraud :-<br />

site2 site2<br />

c


companionpaperalsoavailablefromhttp://www.cs.columbia.edu/sal/<strong>JAM</strong>/PROJECT. 6FutureResearch Bayesexhibited80%TPand13%FPinonesettingand80%TPand19thethreebaseclassi- classiers.Theexperiments,settings,rationaleandresultshavebeenreportedindetailina erswiththeleastcorrelatederrorandintheseconditcombinedthefourmostaccuratebase<br />

Theirresolutionwillproduceanumberofenhancemements,including: The<strong>JAM</strong>prototypeisinanongoingstateofdevelopment.Anumberofissuesrequirestudy. Anintelligentandecientpairingprocess.Thedatasiteswillbeabletodiscernwhich produce(assuminganequalqualityinthetypeoflearningalgorithms). Amechanism<strong>for</strong>integratingnewknowledge.Moreandmoredatawillbecomeavailable,newtrendswillemergeandnewwaystobypassthedetectionmechanismswill datasitestheycamefrom.Therationalbehindthisisthatthemorecompatible, classierscontributepositivelyandthenchoosetocooperatemorecloselywiththe<br />

similartotheknowledgeimportedfromremotesources.<strong>Meta</strong>-<strong>Learning</strong>techniques bedevised.Wewillexplorewaystoincorporatethenewknowledgewithoutturning thepresentandolddataobsolete.Infact,newknowledgecanbetreatedinafashion representativeanddiverseadatabaseis,thebettertheclassiersthedatasitewill<br />

addressedbe<strong>for</strong>e,butfromadierentperspective.Datasiteswillnotonlyexchange multilevelone.Asmoreandmoreclassiersbecomeavailablewewillneedtoorganize themeciently.Thisisagainahierarchicalapproachrelatedtothescalabilityissue Multilevelmeta-learning.Wewillgeneralizetheonelevelmetalearningapproachtoa way,<strong>JAM</strong>willbecomealifelongsystemconstantlyaccumulatingusefulknowledge. nearfuturetheycanalsobeemployedtocombineknowledgeacquired<strong>over</strong>time.This havebeenalreadyinusetocombineknowledge<strong>over</strong>space(remotedatabases),inthe<br />

example,toimplementCID31()(seesection4)asasub-agentwhollycontainedwithin andhandle<strong>agents</strong>,oreventorealizeagentcompositionoperators.Itispossbile,<strong>for</strong> capabilitiesprovidedbyourproposedsystem,isthatitisveryeasytoimport,invoke happensrecursivelyallthewaydowntotheleavesofthetree).Infact,oneofthekey classiercarrieswithitallitschildrenmetaclassiersthroughwhichitwasbuilt;this baseclassiersbutentire"metaclassiertrees".Ametaclassiertreeisatreein<br />

themeta-classierM.Inthiscase,eachsub-agentisdenedasasimple<strong>Java</strong>object whicheachinternalnodeisametaclassierandeachleafisabaseclassier.(Ameta<br />

callablefromwithinM,andwhenthemeta-classieragentMistransmittedtoa remotesite,thesub-agent\travels"withit. Ascalabledecisionmakingprotocol.AsthenumberofparticipatingDatasitesand availabledatabasesincreases,communication,coordinationandeciencybecome problematic.WewillaugmenteachDatasiteofthesystemwithahierarchicaland distributedprotocoltofacilitatescalabilityandeliminatebottlenecks. 14


Wewillreplacethecurrentcongurationlemanagerprocesswithseverallogically distributedprocessesthatwillinteractwitheachotherinordertorealizescalability, approachtomaintaintheglobalconguration.Thisisanobvioussequentialbottleneck. maintaintheglobalconguration,andsupportfaulttolerance. Adistributedcongurationlemanager.Inthisrstversion<strong>JAM</strong>usesacentralized<br />

7Conclusions<br />

intrusiondetectionfacilitiesinglobal-scale,integratedin<strong>for</strong>mationsystems. temsdeployedasintelligent<strong>agents</strong>willbeanimportantcontributingtechnologytodeploysecuredmeta-learningsystemwillprovidethemeansofusinglargenumbersoflow-costnetworkedcomputerswhocollectivelylearnfrommassivedatabasesusefulandnewknowledge, indevelopingsystemsthatlearnfrommassivedatabasesandthatscale.Adeployedand Webelievetheconceptsembodiedbythetermmeta-learningprovideanimportantstep thatwouldotherwisebeprohibitivelyexpensivetoachieve.Webelievemeta-learningsys-<br />

cardtransactions,providedbydierentbanks,inanattempttodetectandpreventfraudby collaborationwiththeFSTCwehavepopulatedthesedatabasesiteswithrecordsofcredit andportableagent-basedsystemthatsupportsthelaunchingoflearningandmeta-learning learning<strong>agents</strong>.Wehaveengaged<strong>JAM</strong>inareal,practicalandimportantproblem.In<strong>over</strong>allpredictiveaccuracyofanumberofindependentylearnedclassiersthroughmeta- <strong>agents</strong>todistributeddatabasesites.<strong>JAM</strong>canintegratedistributedknowledgeandboost Inthispaperwedescribedthe<strong>JAM</strong>architecture,adistributed,scalable,extensible<br />

combininglearnedpatternsandbehaviorsfromindependentsources.<br />

15


References<br />

[3]P.ChanandS.Stolfo.Towardparallelanddistributedlearningbymeta-learning.In [2]P.Chan.AnExtensible<strong>Meta</strong>-<strong>Learning</strong>Approach<strong>for</strong>ScalableandAccurateInductive [1]L.Breiman,J.H.Friedman,R.A.Olshen,andC.J.Stone.ClassicationandRegression<br />

WorkingNotesAAAIWork.Know.Disc.<strong>Databases</strong>,pages227{240,1993. <strong>Learning</strong>.PhDthesis,DepartmentofComputerScience,ColumbiaUniversity,New York,NY,1996.(<strong>for</strong>thcoming). Trees.Wadsworth,Belmont,CA,1984.<br />

[5]P.ChanandS.Stolfo.<strong>Learning</strong>arbiterandcombinertreesfrompartitioneddata<strong>for</strong> [4]P.ChanandS.Stolfo.Acomparativeevaluationofvotingandmeta-learningonparti[6]WilliamW.Cohen.Fasteectiveruleinduction.InProc.TwelfthInternationalContioneddata.InProc.TwelfthIntl.Conf.Machine<strong>Learning</strong>,pages90{98,1995. scalingmachinelearning.InProc.Intl.Conf.KnowledgeDisc<strong>over</strong>yandDataMining, pages39{44,1995.<br />

[8]NaserS.BarghoutiWenkeLee.<strong>Java</strong>dot:Anextensiblevisualizationenvironment. [7]GregoryPiatetsky-ShapiroUsamaFayyadandPadhraicSmyth.Thekddprocess<strong>for</strong> TechnicalReportCUCS-02-97,DepartmentofComputerScience,ColumbiaUniversity, November1996. extractingusefulknowledgefromdata.CommunicationsoftheACM,39(11):27{34, ference.MorganKaufmann,1995.<br />

[10]X.Zhang,M.Mckenna,J.Mesirov,andD.Waltz.Anecientimplementationofthe [9]D.Wolpert.Stackedgeneralization.NeuralNetworks,5:241{259,1992. ThinkingMachinesCorp.,1989. NewYork,NY,1997. backpropagationalgorithmontheconnectionmachineCM-2.TechnicalReportRL89-1,<br />

16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!