JAM: Java agents for Meta-Learning over Distributed Databases
JAM: Java agents for Meta-Learning over Distributed Databases
JAM: Java agents for Meta-Learning over Distributed Databases
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>JAM</strong>:<strong>Java</strong><strong>agents</strong><strong>for</strong><strong>Meta</strong>-<strong>Learning</strong><strong>over</strong><strong>Distributed</strong><strong>Databases</strong><br />
(sal,andreas,sat,wenke,wfan@cs.columbia.edu) SalvatoreStolfo,AndreasL.Prodromidisy ShelleyTselepis,WenkeLee,WeiFan DepartmentofComputerScience NewYork,NY10027 ColumbiaUniversity<br />
FloridaInstituteofTechnology Melbourne,FL32901 ComputerScience PhilipK.Chan (pkc@cs.fit.edu) March5,1997<br />
<strong>agents</strong><strong>for</strong>combiningmultiplemodelsthatwerelearned(perhaps)atdierentsites.It ingapplicationsthatwehavecometocallmeta-learning.<strong>JAM</strong>providesasetofagent-baseddataminingsystemthatemploysageneralapproachtoscalingdatamin- models<strong>over</strong>datastoredlocallyatasite.<strong>JAM</strong>alsoprovidesasetofmeta-learning learningprograms,implementedeitherasJAVAappletsorapplications,thatcompute Inthispaper,wedescribethe<strong>JAM</strong>system,adistributed,scalableandportable Abstract<br />
modelsorclassier<strong>agents</strong>tootherremotesites.Wedescribethe<strong>over</strong>allarchitec- employsaspecialdistributionmechanismwhichallowsthemigrationofthederived tureofthe<strong>JAM</strong>systemandthespecicimplementationcurrentlyunderdevelopment atColumbiaUniversity.Oneof<strong>JAM</strong>'stargetapplicationsisfraudandintrusion detectioninnancialin<strong>for</strong>mationsystems.Abriefdescriptionofthislearningtask and<strong>JAM</strong>'sapplicabilityarealsodescribed.Interestedusersmaydownload<strong>JAM</strong>from<br />
YorkStateScienceandTechnologyFoundationundergrantPolytechnic423115-445. CISEResearchInfrastructureGrantProgramoftheNationalScienceFoundationundergrantCDA-96-25374 and,theCenter<strong>for</strong>AdvancedTechnologyatPolytechnicUniversity(notColumbiaUniversity)oftheNew ModelsandCognitiveSystemsProgramsoftheNationalScienceFoundationundergrantIRI-96-32225,the ResearchProjectsAgencyundergrantF30602-96-1-0311,theDatabaseandExpertSystemsandKnowledge ThisresearchissupportedbytheIntrusionDetectionProgram(BAA9603)oftheDefenseAdvanced http://www.cs.columbia.edu/sal/<strong>JAM</strong>/PROJECT.<br />
ySupportedbyIBM
1Introduction<br />
stochasticmodels,toalgorithmsbaseduponpurelysymbolicdescriptionslikerulesanddeci- Onemeansofacquiringnewknowledgefromdatabasesistoapplyvariousmachinelearning maybeexhibitedinthedata.Theeldofmachinelearninghasmadesubstantialprogress applicationsindiverseelds.Therearenumerousalgorithmsrangingfromthosebasedupon <strong>over</strong>theyearsandanumberofalgorithmshavebeenpopularizedandappliedtoahostof algorithmsthatcomputedescriptiverepresentationsofthedataaswellaspatternsthat<br />
Indeed,dothecurrentgenerationofmachinelearningalgorithmsscalefromtaskscommon todaythatincludethousandsofdataitemstonewlearningtasksencompassingasmuchas twoordersofmagnitudeormoreofdatathatisphysicallydistributed?Furthermore,many largedatabasesandwait<strong>for</strong>aresponse!However,thequestionishowlongmightwewait? existinglearningalgorithmsrequireallthedatatoberesidentinmainmemory,whichis clearlyuntenableinmanyrealisticdatabases.Incertaincases,dataisinherentlydistributed siontrees.Thus,wemaysimplyapplythecurrentgenerationoflearningalgorithmst<strong>over</strong>y<br />
norfeasible,toinspectallofthedataatoneprocessingsitetocomputeoneprimary\global" classier.Wecalltheproblemoflearningusefulnewknowledgefromlargeinherentlydis- andcannotbelocalizedonanyonemachine(evenbyatrustedthirdparty)<strong>for</strong>avariety tributeddatabasesthescalingproblem<strong>for</strong>machinelearning.Weproposetosolvethescaling aswellasstatutoryconstraintsimposedbylaw.Insuchsituations,itmaynotbepossible, ofpracticalreasonsincludingphysicallydispersedmobileplat<strong>for</strong>mslikeanarmadaofships, securityandfaulttolerantdistributionofdataandservices,competitive(business)reasons, problembywayofatechniquewehavecometocall\meta-learning".<strong>Meta</strong>-learningseeks computea\meta-classier"thatintegratesinsomeprincipledfashiontheseparatelylearned classierstoboost<strong>over</strong>allpredictiveaccuracy. tocomputeanumberofindependentclassiersbyapplyinglearningprogramstoacollectionofindependentandinherentlydistributeddatabasesinparallel.The\baseclassiers" socomputedarethenintegratedbyanotherlearningprocess.Heremeta-learningseeksto<br />
powerfulportableandextensiblenetworkagent-basedsystemthatcomputesmeta-classiers <strong>over</strong>distributeddata.<strong>JAM</strong>isbeingengagedinexperimentsaddressingreal-worldlearning theinternet.Insection3,wedetailthe<strong>JAM</strong>(<strong>Java</strong>Agents<strong>for</strong><strong>Meta</strong>-<strong>Learning</strong>)architecture,a developingmeta-learningsystemsandapplythesetechniquestoarangeoflarge-scaledisplishingthistaskbywayofmeta-learningaspreviouslyreportedin[4].Weseektocontinuetributedapplicationsbyutilizingexistingagent-basedinfrastructures<strong>for</strong>deployment<strong>over</strong>Inthefollowingpageswepresentasummary<strong>over</strong>viewofavarietyofwaysofaccom- systems.Sections4and5describethiseort.Insection6wediscussourfutureresearch andsection7concludesthepaper. 2<strong>Meta</strong>-<strong>Learning</strong> taskssuchassolvingkeyproblemsinfraudandintrusiondetectioninnancialin<strong>for</strong>mation<br />
<strong>for</strong>arangeofdierentapplications.<strong>Meta</strong>-learningisproposedastheunifyingapproach. Wedesireaunifyingandscalablesolutionthatimprovestheeciencyandaccuracyof inductivelearningwhenappliedtolargeamountsofdatainwideareacomputingnetworks 1
thesameserialcodeatmultiplesiteswithoutthetime-consumingprocessofwritingparallel programsandsecond,thelearningprocessusessmallsubsetsofdatathatcantinmain implementedasadistinctserialprogram)onanumberofdatasubsets(adatareduction thecollectiveresultsthroughmeta-learning.Thisapproachhastwoadvantages,rstituses technique)inparallel(eg.<strong>over</strong>anetworkofseparateprocessingsites)andthentocombine memory.Theaccuracyofthelearnedconceptsbytheseparatelearningprocessmightbe Ourapproachtoimproveeciencyistoexecuteanumberoflearningprocesses(each<br />
lowerthanthatoftheserialversionappliedtotheentiredatasetsinceaconsiderableamount ofin<strong>for</strong>mationmaynotbeaccessibletoeachoftheindependentandseparatelearningpro- achieveaccuracylevels,comparabletothatreachedbythea<strong>for</strong>ementionedserialversion cesses.Ontheotherhand,combiningthesehigherlevelconceptsviameta-learning,may appliedtotheentiredataset.Furthermore,thisapproachmayuseavarietyofdierent learningalgorithmsondierentcomputingplat<strong>for</strong>ms.Becauseoftheproliferationofnetworksofworkstationsandthegrowingnumberofnewlearningalgorithms,ourapproachdoesnotrelyonanyspecicparallelordistributedarchitecture,noronanyparticularalgolicationshavereportedper<strong>for</strong>manceresultsonstandardtestproblemsanddatasetswithbitration,combining[3]andhierarchicaltree-structuredmeta-learningsystems.Otherpubrithm,andthusdistributedmeta-learningmayaccommodatenewsystemsandalgorithmscombining[10]tonameafew.Weshallnotrepeatthisexpositioninthispaper.Herewede- relativelyeasily.Ourmeta-learningapproachisintendedtobescalableaswellasportable<br />
scribethe<strong>JAM</strong>systemarchitecturedesignedtosupporttheseandperhapsotherapproaches discussionsofrelatedtechniques,Wolpert'sstacking[9],Breiman'sbagging[1]andZhang's andextensible.<br />
todistributeddatamining. Inpriorpublicationsweintroducedanumberofmeta-learningtechniquesincludingar-<br />
togetherthroughanetworkofDatasites.Each<strong>JAM</strong>Datasiteconsistsof: supportsthelaunchingoflearningandmeta-learning<strong>agents</strong>todistributeddatabasesites. <strong>JAM</strong>isimplementedasacollectionofdistributedlearningandclassicationprogramslinked designedasanextensionofOSenvironments.Itisadistributedmeta-learningsystemthat 3The<strong>JAM</strong>architecture <strong>JAM</strong>isarchitecturedanagentbasedsystem,adistributedcomputingconstructthatis Alocaldatabase, Alearningagent,amachinelearningprogramthatmaymigratetoothersitesasa JAVAapplet,orbelocallystoredasanativeapplicationcallablebyaJAVAapplet, Ameta-learningagent, Alocalusercongurationle, GraphicalUserInterfaceandAnimationfacilities. 2
combinethesewithitsownlocalclassierusingthelocalmeta-learningagent.Theseactions classiers.EachDatasitemaythenimport(remote)classiersfromitspeerDatasitesand maytakeplaceatallDatasitessimultaneouslyandindependently. classier<strong>agents</strong>thatarecomputedbythelearning<strong>agents</strong>. The<strong>JAM</strong>Datasiteshavebeendesignedtocollaborate1witheachothertoexchange First,locallearning<strong>agents</strong>operateonthelocaldatabaseandcomputetheDatasite'slocal<br />
parameters(be<strong>for</strong>ethebeginningofthelearningandmeta-learningtasks),theownerofthe Datasitecanalsoemploy<strong>JAM</strong>'sgraphicaluserinterfaceandanimationfacilitiestosupervise databasestobeused,thepolicytopartitionthesedatabasesintotrainingandtestingsubsets, thelocallearning<strong>agents</strong>tobedispatched,etc.Besidesthestaticspecicationofthelocal per<strong>for</strong>mthelearningandmeta-learningtasks.Suchparametersincludethenamesofthe le.Throughthisle,he/shecanspecifytherequiredandoptionallocalparametersto TheownerofaDatasiteadministersthelocalactivitiesviathelocaluserconguration<br />
integratenewknowledgethathasbecomeavailable,orevendiscardobsoleteclassiers.Or, logsandcompareandanalyzeresultsinordertoimproveper<strong>for</strong>mance. he/shecanuse<strong>JAM</strong>'spresentationtoolstoinspectthegenerateddecisiontreeclassiers agentexchangesandadministerdynamicallythemeta-learningprocess.Withthisgraphical interface,theownermayaccessmorein<strong>for</strong>mationsuchasaccuracy,trends,statisticsand directlytogainvaluableintuition.thatmaybecomputed,<strong>for</strong>examplebyID3. Forexample,theownermaystudyresultsanddecidetorepeatthelearningprocessor<br />
Manager(CFM),acentralandindependentmoduleresponsible<strong>for</strong>keepingthestateofthe systemup-to-date.TheCFMisasaserverthatprovidesin<strong>for</strong>mationabouttheparticipating executionofthesemodulestoclassifyandlabeldatasetsofinterest. Datasitesandlogsevents<strong>for</strong>futurereferenceandevaluation. ThecongurationofthedistributedsystemismaintainedbytheCongurationFile Finally,oncethebaseandmeta-classiersarecomputed,the<strong>JAM</strong>systemmanagesthe<br />
thisexample,theCFMrunsonCherryandeachDatasiteendsupwiththreebaseclassiers (onelocalplusthetworemoteclassiers). tobeused,theimagestobeusedbytheanimationfacility,thefoldingparameters,etc.In thelearningtaskbysettingtheparametersoftheusercongurationle,i.e.thealgorithms thisexample,three<strong>JAM</strong>DatasitesMarmalade,MangoandStrawberryexchangetheirbase classierstosharetheirlocalviewofthelearningtask.TheowneroftheDatasitecontrols Thelogicalarchitectureofthe<strong>JAM</strong>meta-learningsystemispresentedinFigure1.In<br />
programs;andthiswasdone<strong>for</strong>fasterprototypedevelopmentandproofofconcept.The themunderremoteorlocalcontrol.Thegraphicaluserinterface,theanimationfacilities <strong>agents</strong>.JAVAtechnologyprovidesthemeanstodispatch<strong>agents</strong>toremotesitesandexecute partsthatwereimportedintheirnative(C++)<strong>for</strong>mweresomeofthemachinelearning thespecicagentoperatorsthatcomposeandspawnnew<strong>agents</strong>fromexistingclassier andmostofthemachinelearningalgorithmswerealsoimplementedinJAVA.Theonly WehaveusedJAVAtechnologytobuildtheinfrastructureofthesystemanddeveloped<br />
Theplat<strong>for</strong>m-independenceofJAVAtechnologymakesiteasytoport<strong>JAM</strong>anddelegateits <strong>JAM</strong>systembuildsupontheexistingagentinfrastructureavailable<strong>over</strong>theinternettoday. 1ADatasitemayalsooperateindependentlywithoutanychanges. 3
DATA SITES:<br />
Marmalade.cs<br />
Control & Data<br />
+ Mango.cs<br />
Figure1:Thearchitectureofthemeta-learningsystem.<br />
messages<br />
+ Strawberry.cs<br />
Configuration<br />
Mango.cs<br />
Transfer of <strong>Learning</strong><br />
Configuration File Manager<br />
+ Strawberry.cs<br />
& Classifier Agents<br />
Database<br />
+ Marmalade.cs<br />
Cherry.cs.columbia.edu<br />
Strawberry.cs<br />
+ Marmalade.cs<br />
+ Mango.cs<br />
Data Site - 1<br />
Data Site - 3<br />
Marmalade.cs<br />
Strawberry.cs<br />
Data Site - 2<br />
TheCFMassumesaroleequivalenttothatofanameserverofanetworksystem.Itis yetplat<strong>for</strong>mindependent.) 3.1CongurationFileManager <strong>agents</strong>toanyparticipatingsite.(ThemodulesthatareimplementedinnativeC++arenot<br />
Configuration<br />
Configuration<br />
Datasite<br />
Mango.cs<br />
Datasite<br />
File<br />
Database<br />
File<br />
Database<br />
CFM = Cherry.cs.columbia.edu<br />
DATASET = thyroid<br />
LEARNER = ID3<br />
totheparticipatingDatasites. responsible<strong>for</strong>maintainingthe\global"congurationofthesystemandmakingitavailable<br />
META_LEARNER = Bayes<br />
Configuration<br />
Datasite<br />
CROSS_VALIDATION_FOLD = 2<br />
Database<br />
File<br />
META_LEARNING_FOLD = 2<br />
META_LEARNING_LEVEL = 1<br />
TheCFMprovidesregistrationservicestoallDatasitesthatwishtobecomemembers<br />
IMAGE_URL = http://www.cs....<br />
The <strong>JAM</strong> architecture with 3 datasites<br />
andparticipateinthedistributedmeta-learningactivity.WhentheCFMreceivesaJOIN Datasiteasinactiveandremovesitfromitslistofmembers.TheCFM,maintainsthelist ofactivememberDatasitestoestablishcontactandcooperationbetweenpeerDatasites. Apartfromthat,theCFMkeepsin<strong>for</strong>mationregardingthegroupsthatare<strong>for</strong>med(which requestfromanewDatasite,itveriesboththevalidityoftherequestandtheidentityof Similarly,theCFMcanreceiveandverifytheDEPARTURErequest;itnotestherequestor DatasitescollaboratewithwhichDatasites),logstheeventsanddisplaysthestatusofthe theDatasite.Uponsuccess,itacknowledgestherequestandregisterstheDatasiteasactive.<br />
participate. 3.2Datasites UnlikeCFMwhichprovidesapassivecongurationmaintainancefunction,theDatasitesare theactivecomponentsofthemeta-learningsystem. system.ThroughtheCFM,the<strong>JAM</strong>systemadministratormayscreentheDatasitesthat<br />
interactswitha<strong>JAM</strong>user.ADatasiteisimplementedasamultithreaded<strong>Java</strong>program withaspecialGUI. database,buildslocalclassiers,obtainsremoteclassiers,buildslocalmetaclassiersand TheDatasitesareresponsible<strong>for</strong>runningtheshow.ADatasitemanagesitslocal 4
socket<strong>for</strong>listening<strong>for</strong>connections3fromthepeerDatasites. completetheDatasitewaits<strong>for</strong>thenexteventtooccur.Thiscanbeeither registerswiththeCFM,instantiatesthelocallearningengine/agent2andcreatesaserver displaystatusandresults.Duringitsinitialization,amongitsothertasks,theDatasite TheDatasiteisamoduledrivenbyinputmessagesorcommands.Afterinitializationis Uponinitialization,aDatasitestartsuptheGUIthroughwhichitcanacceptinputand<br />
isestablished,theDatasiteallocatesaseparatethreadandper<strong>for</strong>mstherequiredtask.This taskcanbeanyof<strong>JAM</strong>'sfunctions:computingalocalclassier,startingthemeta-learning process,sendinglocalclassierstopeerDatasitesorrequestingremoteclassiersfromthem, 1.AcommandissuedbytheownerviatheGUI,or<br />
reportingthecurrentstatus,orpresentatingcomputedresults. 2.AmessagefromapeerDatasiteviatheopensocket. Inbothcases,theDatasiteveriesthattheinputisvalidandcanbeserviced.Oncethis<br />
roidism.Thesnapshottakenisfrom\Marmalade'spointofview".Initially,Marmalade consultstheDatasitecongurationlewheretheowneroftheDatasitesetstheparameters. Inthiscase,thedatasetisamedicaldatabasewithrecords,notedbythyroidintheData examplethreeDatasites,Marmalade,StrawberryandMango(seethegrouppanelofthe Setpanel.OtherparametersincludethehostoftheCFM,theCross-ValidationFold,the gure)collaborateinordertoshareandimprovetheirknowledgeindiagnosinghypothy- Figure2presentsasnapshotofthe<strong>JAM</strong>systemduringthemeta-learningphase.Inthis<br />
itspeerDatasites,havingacquiredthisin<strong>for</strong>mationfromtheCFM.) theseparameters.(NoticethatMarmaladehasestablishedthatStrawberryandMangoare localmeta-learningagent,etc.Referto[2]<strong>for</strong>morein<strong>for</strong>mationonthemeaninganduseof <strong>Meta</strong>-<strong>Learning</strong>Fold,the<strong>Meta</strong>-<strong>Learning</strong>Level,thenamesofthelocallearningagentandthe<br />
andMango.1andbeginsthemeta-learningprocess.ThesnapshotofFigure2displaysthe systematthisstage.Intheanimatedmeta-learningprocess<strong>JAM</strong>'sGUImovesiconswithin thepaneldisplayingtheconstructionofanewmeta-classier.Marmaladewillusethis classier,notedbyMarmalade.1(herebycallingtheID3learningagent)viewedatthemain panel.Next,MarmaladeMarmaladeimportstheremoteclassiers,notedbyStrawberry.1 roid.2.bldintheDataSetpanel)<strong>for</strong>the2-Cross-ValidationFold,andcomputesafterthelocalThen,Marmaladepartitionsthethyroiddatabase(notedasthyroid.1.bldandthy- thelearningandmetalearningprocess.He/Shecanobservetheinternalsofthegenerated meta-classierinthefuturetopredicttheclassesofinputdataitems(inthiscaseunlabelled medicalrecords). classiersandmetaclassiersandgetreportsontheresultsandstatistics. obtainin<strong>for</strong>mationregardingthecentralCFMandthetypesoftheavailablemachinelearning<strong>agents</strong>. 3Foreachconnection,theDatasitespawnsaseparatethread. 2TheDatasiteconsultsthelocalDatasitecongurationle(maintainedbytheowneroftheDatasite)to Theownerofa<strong>JAM</strong>Datasitehasdirectcontrol<strong>over</strong>thestagesandtheprogressof<br />
5
displayedintheClassierVisualizationPanel <strong>JAM</strong>providesgraphdrawingtoolstohelpusersunderstandthelearnedknowledge[7].There 3.3ClassierVisualization themetaclassier(metalearningstage).Right:AID3tree-structuredclassierisbeing Figure2:Twodierentsnapshotsofthe<strong>JAM</strong>systeminaction.Left:Marmaladeisbuilding<br />
leafnodesrepresentclasses(decisions),thenon-leafnodesrepresenttheattributesunder translatortoreadtheclassierandgeneratea<strong>Java</strong>Dotgraphrepresentation. todisplaytheclassierandallowstheusertoanalyzethegraph.Sinceeachmachinelearning algorihtmhasitsown<strong>for</strong>mattorepresentthedataclassier,<strong>JAM</strong>usesanalgorithm-specic aremanykindsofclassiers,e.g.,adecisiontreebyID3,thatcanberepresentedasgraphs.In<br />
test,andtheedgesrepresenttheattributevalues.Theusercanselectthe\Attributes" <strong>JAM</strong>wehaveemployedmajorcomponentsof<strong>Java</strong>Dot[8],anextensiblevisualizationsystem,<br />
commandfromthe\Object"pull-downmenutoseeanyadditionalin<strong>for</strong>mationabouta nodeoranedge.Inthegure,the\Attributes"windowshowstheclassifyingin<strong>for</strong>mationof thehighlightedleafnode4.Itisdiculttoviewclearlyaverylargegraph(thathasalarge numberofnodesandedges)duetothelimitedwindowsize.Theclassiervisualizationpanel Figure2showsthe<strong>JAM</strong>classiervisualizationpanelwithadecisiontree,wherethe<br />
toviewtheenclosinggraph;andusethe\Root"commandtoseetheentireoriginalgraph. startingfromtheselectednodebetheentiregraphindisplay;usethe\Parent"command selectanodeandusethe\Top"commandfromthe\Graph"menutomakethesubgraph providescommands<strong>for</strong>theusertotraverseandanalyzepartsofthegraph:theusercan<br />
displaysitintheclassiervisualizationpanel. e.g.,therulesetsfromRipper[6].Itisthuscounter-intuitivetotranslatethetexttograph <strong>for</strong>m<strong>for</strong>displaypurposes.Insuchcases,<strong>JAM</strong>simplypretty<strong>for</strong>matsthetextoutputand belongstoclass\0"with.889probability. 4Thusvisually,weseethat<strong>for</strong>atestdataitem,ifits\p-2"valueis3andits\p-14"valueis2,thenit Somemachinelearningalgorithmsgenerateconciseandveryreadabletextualoutputs,<br />
6
userinterfacecontainsacollectionofanimationpanelswhichvisuallyillustratethestagesof 3.4Animation Fordemonstrationanddidacticpurposes,themeta-learningcomponentofthe<strong>JAM</strong>graphical meta-learninginparallelwithexecution.Whenanimationisenabled,atransitionintoanew stageofcomputationoranalysistriggersthestartoftheanimationsequencecorresponding totheunderlyingactivity.Theanimationloopscontinuouslyuntilthegivenactivityceases. learningstage(byclickinga\Next"button),orsendingtheprocessintoautomaticexecution disabledandexecutionsettoautomatictransitiontothenextstageintheprocess. 3.5Agents halt.For\handsfree"operationof<strong>JAM</strong>,theusercanstarttheprogramwithanimation (byclickinga\Continue"button).Themanualrunoptionprovidesatemporaryprogram The<strong>JAM</strong>programgivestheusertheoptionofmanuallyinitiatingeachdistinctmeta-<br />
allsubclasseshavetocomplyto.Aslongasalearningormeta-learningagentcon<strong>for</strong>msto <strong>agents</strong>ubclasses,theparentagentclassprovidesaverysimpleandminimalinterfacethat initionoftheparentagentclassandeveryinstanceagent(i.e.aprogramthatimplements anyofyourfavoritelearningalgorithmsID3,CART,BAYES,WPEBLS,etc.)arethen denedasasubclassofthisparentclass.Amongotherdenitionswhichareinheritedbyall <strong>JAM</strong>'sextensibleplug-and-playarchitectureallowssnap-inlearning<strong>agents</strong>.<br />
thisinterface,itcanbeintroducedandusedimmediatelyinthe<strong>JAM</strong>systemevenduring Thelearningandmeta-learning<strong>agents</strong>aredesignedasobjects.<strong>JAM</strong>providesthedef-<br />
execution. <strong>JAM</strong>touseiteectively: 2.Aninitialize()method.Inmostofthecases,ifnotall,the<strong>agents</strong>ubclassesinherit 1.Aconstructormethodwithnoarguments.<strong>JAM</strong>cantheninstantiatetheagent,pro- Tobemorespecic,a<strong>JAM</strong>agentneedstohavethefollowingmethodsimplemented<strong>for</strong><br />
testdatasets,thenameofthedictionaryle,andthelenameoftheoutputclassier. necessaryargumentstotheagent.Argumentsincludethenamesofthetrainingand eitherthelocalusercongurationleortheGUI). thismethodfromtheparentagentclass.Throughthismethod,<strong>JAM</strong>cansupplythe videditknowsitsname(whichcanbesuppliedbytheowneroftheDatasitethrough<br />
4.AgetClassier()andgetCopyOfClassier()methods.Thesemethodsareusedby<strong>JAM</strong> 3.AbuildClassier()method.<strong>JAM</strong>callsthismethodtotriggertheagenttolearn(or toobtainthenewlybuiltclassifers.Thesearethenencapsulatedandcanbe\snappedin"atanyotherparticipatingDatasite!Hence,remoteagentdispatchiseasilyaccom meta-learn)fromthetrainingdataset.<br />
sentedinFigure3.ID3,Bayes,WpeblsandRipperinheritthemethodsinitialize()andTheclasshierarchy(onlymethodsareshown)<strong>for</strong>fourdierentlearning<strong>agents</strong>ispreplished. 7
Learner<br />
getClassier()fromtheirparentlearningagentclass.The<strong>Meta</strong>-<strong>Learning</strong>,Classierand<br />
Learner(),<br />
boolean initialize(String dbName, ...)<br />
<strong>Meta</strong>-Classierclassesaredenedinsimilarhierarchies. Figure3:Theclasshierarchyoflearning<strong>agents</strong>.<br />
boolean BuildClassifier()<br />
Classifier getCopyOfClassifier()<br />
Classifier getClassifier() {<br />
return classifier;<br />
}<br />
makes<strong>JAM</strong>trulypowerfulandextensibledataminingfacility. interfacesalready)itcanbeimportedanduseddirectly.Thisplug-and-playcharacteristic interest.Aslongasamachinelearningprogramisdenedandencapsulatedasanobject con<strong>for</strong>mingtotheminimalinterfacerequirements(mostexistingalgorithmshavesimilar <strong>JAM</strong>isdesignedandimplementedindependentlyofthemachinelearningprogramsof<br />
ID3Learner BayesLearner WpeblsLearner RipperLearner<br />
ID3Learner()<br />
BayesLearner()<br />
WpeblsLearner()<br />
RipperLearner()<br />
boolean BuildClassifier()<br />
boolean BuildClassifier()<br />
boolean BuildClassifier()<br />
boolean BuildClassifier()<br />
Classifier getCopyOfClassifier() Classifier getCopyOfClassifier() Classifier getCopyOfClassifier() Classifier getCopyOfClassifier()<br />
4FraudandIntrusionDetection<br />
Decision Tree<br />
Probabilistic<br />
Nearest Neighbor<br />
Rule-Based<br />
Asecuredandtrustedinterbankingnetwork<strong>for</strong>electroniccommercerequireshighspeed<br />
approachisrequired,involvingtheperiodicsharingwitheachotherofin<strong>for</strong>mationabout electronictransactionsareasignicantproblem,onethatwillgrowinimportanceasthe numberofaccesspointsinthenation'snancialin<strong>for</strong>mationsystemgrows. totheirownassetbases.Recentlythough,bankshavecometorealizethataunied,global ducttheirbusiness,whilethwartingfraudulenttransactionattemptsbyothers.Fraudulentvericationandauthenticationmechanismsthatallowlegitimateuserseasyaccesstocon- attacks. Financialinstitutionstodaytypicallydevelopcustomfrauddetectionsystemstargeted<br />
computethesemodels. actionbehaviorstoproducemodelsof\probablyfraudulent"transactions.Weuse<strong>JAM</strong>to anomalousorerranttransactionbehaviorsto<strong>for</strong>ewarnofimpendingthreats.Thisapproach requiresanalysisoflargeandinherentlydistributeddatabasesofin<strong>for</strong>mationabouttrans- Thisnewwallofprotectionconsistsofpattern-directedinferencesystemsusingmodelsof Thekeydicultiesinthisapproachare:nancialcompaniesdon'tsharetheirdata<strong>for</strong> Wehaveproposedanotherwalltoprotectthenation'snancialsystemsfromthreats.<br />
transactionbehaviorarehugeandgrowingrapidly;real-timeanalysisishighlydesirableto anumberof(competitiveandlegal)reasons;thedatabasesthatcompaniesmaintainon updatemodelswhenneweventsaredetectedandeasydistributionofmodelsinanetworked<br />
environmentisessentialtomaintainuptodatedetectioncapability.<br />
8
meta-learning<strong>agents</strong>. Datasite(s),twoormoresuch<strong>agents</strong>maybecomposedintoanewclassieragentby<strong>JAM</strong>'s integratedmeta-learningsystemthatcombinesthecollectiveknowledgeacquiredbyindividuallocal<strong>agents</strong>.Oncederivedlocalclassier<strong>agents</strong>ormodelsareproducedatsome provideintrusiondetectionserviceswithinasinglecorporatein<strong>for</strong>mationsystem,andan <strong>JAM</strong>allowsnancialinstitutionstosharetheirmodelsoffraudulenttransactionsby <strong>JAM</strong>isusedtocomputelocalfrauddetection<strong>agents</strong>thatlearnhowtodetectfraudand<br />
disclosetheirproprietarydata.Inthiswaytheircompetitiveandlegalrestrictionscan exchangingclassier<strong>agents</strong>inasecuredagentinfrastructure.Buttheywillnotneedto<br />
incomingtransaction. ofpossiblyfraudulenttransactionsandthreatsbyinspecting,classifyingandlabelingeach classieragentfromthesharedmodels.Themeta-classiersthenactassentries<strong>for</strong>ewarning constructed,oralternativelyitcanbelocal.Inthelatterguise,eachcorporateentitybenets bemet,buttheycanstillsharein<strong>for</strong>mation.Themeta-learnedsystemcanbeglobally<br />
4.1HowLocalDetectionComponentsWork fromthecollectiveknowledgebyusingitsprivatelyavailabledatatolocallylearnametaConsiderthegenericproblemofdetectingfraudulenttransactions,inwhichwearenotcon-<br />
theirdatabases,DBi,toproduceaclassierfi.Inthesimplestversion,allthefihavethe sameinputfeaturespace.Notethateachfiisjustamappingfromthefeaturesspaceof cernedwithglobalcoordinatedattacks.Weposittherearetwogoodcandidateapproaches. transaction,x,toabimodalfraudlabel. Approach1: i)Eachbanki;1
(adierentone<strong>for</strong>eachbank)ratherthanDB.)EachsuchmappingisFi. // (Thisisexactlyasinstep(iii)ofapproach1,exceptthedatasetused<strong>for</strong>combiningisTi Figure4:Sharingknowledgewithoutsharingdata //<br />
reallyexpectthatFiisallthatmuchbetterinpredictiveaccuracythanaclassiersimply DBcomesfrom,orhowitmightbe<strong>for</strong>med.However,thereisadierentissue-wouldone trainedontheentiresetofavailabledata,DBiSTi?Afterall,theFiarecreatedbylooking solelyatbanki'slocaldata;thefj6=iareinessencejustnewfeaturesbankicanusetolook Thislatterapproachisdepictedingure4.Notethatnowthereisnoissueofwhere iv)EachbankusesitsFiasinstep(iv)ofapproach1.<br />
Classifier<br />
Hereweprovideageneralviewofthedataschema<strong>for</strong>thelabelledtransactiondatasets aredescribednext. 5CreditCardFraudTransactionData atitsdata.Formalstudiesareunderwaytoanswerthisquestion.Somepreliminaryresults<br />
compiledbyabankandusedbyoursystem.Forpurposesofourresearchanddevelopment recordsspanningoneyear,sampling,onaverage,42,000permonth,fromNovember1995to October1996. activity,severaldatasetsarebeingacquiredfromseveralbanks,eachproviding.5million in<strong>for</strong>mationisnotdisclosedhere.(Afterallweseeknottoteach\wanabethieves"important schemaofthisdataisprovidedinsuchawaythatimportantcondentialandproprietary about30numericattributesincludingthebinaryclassication(fraud/legitimatetransac- lessonsonhowtohonetheirskills.)Therecordshaveaxedlengthof137byteseachand ysisbybankpersonneltocaptureimportantin<strong>for</strong>mation<strong>for</strong>frauddetection.Thegeneraltion).Someoftheeldsarearithmeticandtherestcategorical,i.e.numberswereusedtoTheschemaofthedatabasewasdeveloped<strong>over</strong>yearsofexperienceandcontinuousanal- representafewdiscretecategories.Thein<strong>for</strong>mationineachrecordincludes: A(non-revealing)hashedcreditcardaccountnumber. Scoresproducedbyacommercialauthorization/detectionsystem Thedateandtimeofeachtransaction10<br />
Local<br />
Classifier<br />
Remote<br />
Classifier 1<br />
Local<br />
<strong>Meta</strong>classifier<br />
<strong>Meta</strong>-level<br />
Training<br />
Data<br />
Remote<br />
2<br />
Remote<br />
Classifier n
Pastpaymentin<strong>for</strong>mationofthetransactor Geographicin<strong>for</strong>mation,thatis,in<strong>for</strong>mationregardingthelocationswherethetrans- Theamountofthetransaction<br />
Acode<strong>for</strong>otherrecent\non-monetary"transactiontypesper<strong>for</strong>medbythetransactor Anindustrystandardcode<strong>for</strong>thetypeofmerchant Codes<strong>for</strong>thevalidityandthemannerofentryofthetransaction actionwasinitiatedandthelocationofthemerchantandtransactor<br />
Theageoftheaccountandthecard Othercreditcardaccountin<strong>for</strong>mation<br />
determinedseparately,thatprovidespredicitivevalueindeterminingfraudulenttransaction However,eachbankalsoincludesspeciceldscontainingimportantin<strong>for</strong>mation,theyhave patterns.Theintegrationofthisin<strong>for</strong>mationacrossseparatelylearnedclassiersateach Itisinterestingtonotethatmostofthisin<strong>for</strong>mationisindeedcapturedbyeachbank. Thefraudlabel(thetransactionwaseitherfraudulentorlegitimate) CondentialandProprietaryFieldswhichpotentiallycarryotherindicators)<br />
BankB: MethodI:LearnalocalmodelusingPFeldsandexchange.Let'sassumethat banksiteisanon-trivialproblem,calleddataschemaintegrationproblem. thefrauddetectionmodellearnedinBankAincludesthePFelds.Let'salsoassumethat ofthisdiscussion,wedistinguishtwoseparatedatasetsfromtwobankscalledBankAand thedataattheBankBsitedonotincludetheseeldsandhencetheclassiercannotdeal Wedescribetwoapproaches<strong>for</strong>handlingthese\proprietaryelds"(PF).Forpurposes<br />
withthem.Then,BankBwillnotbeabletouseBankA'smodeldirectlyunless: 2.BankBsimplyincludesnullvaluesina\bogus"PFeldaddedtotheBankBdataset. 1.BankB\massages"itsdatatoincludePFvalues.Todothis,BankBmustimport, value<strong>for</strong>BankB.Afterall,BankBdidnotincludetheminitsauthorizationsystem andpresumablyotherattributes(includingthecommonones)dohavepredictivevalue. EventhoughthePFeldsmayhavehighpredictivevalue<strong>for</strong>BankA,theyareofno inallcases,andmaynotbedesirablebyBankA. alongwiththeremoteclassier,asecure/trustedagentfromBankAthatcancompute values<strong>for</strong>themissingPF'sintheBankB'sdata.Thishowevermaynotbepossible<br />
11
thatthedataatBankAincludesomeadditionalPFeldsthatBankB'sdatalack.Inthis totheintersectionoftheeldsofthedatasetsofthetwobanks,impliesthatthesecond laterusebythemeta-learning<strong>agents</strong>,whilethesecondone,withouttheseeldsisexchanged. approach,BankAcanlearntwolocalmodels.OnewiththePFeldsisstoredlocally<strong>for</strong> classiermakesuseonlyoftheattributesthatarecommonamongtheparticipatingsites MethodII:LearnamodelusingPFeldsandholdlocally.Again,weassume<br />
importedbyBankA(andassurednottoinvolvepredictions<strong>over</strong>thePFelds)canstill andnoissueexists<strong>for</strong>itsintegrationatotherBanks.Ontheotherhand,remoteclassiers belocallyintegratedwiththeoriginalmodelthatemployesthePFelds.Inthiscase,the remoteclassierssimplyignorethePFeldsofthelocaldataset. <strong>Learning</strong>asecondclassierwithoutthePFelds,orbetteryet,witheldsthatbelong<br />
5.1Descriptionofthelearningprocess thesemodelsshouldproceedinastraight<strong>for</strong>wardmanner. Inthissection,wedescribethesettingofourexperiments.Inparticular,wesplittheoriginal datasetprovidedbyonebankintorandompartitionsandwedistributedthemacrossthe Bothapproachesaddressthedataschemaintegrationproblemand<strong>Meta</strong>-learning<strong>over</strong><br />
dierentsitesofthe<strong>JAM</strong>network.Thenwecomputedtheaccuracyfromeachmodel obtainedateachsuchpartition. datasetweusedinourexperiments,andkeptthem<strong>for</strong>theValidationandTestsetsto evaluatetheaccuracyoftheresultantdistributedmodels.Thelearningtaskistoidentify sitesofdata(saysites1and2),whiletwoinstancesofRipperareappliedelsewhere(sayat patternsinthe30attributeeldsthatcancharacterizethefraudulentclasslabel. sites3and4),allbeinginitiatedas<strong>Java</strong><strong>agents</strong>.Theresultofthesefourlocalcomputations Tobemorespecic,wesampled84,000recordsfromthetotalof500,000recordsofthe<br />
arefourseparateclassiers,CID3i();i=1;2,andCRipperj();j=3;4thatareeachinvocable as<strong>agents</strong>atarbitrarysitesofcreditcardtransactiondata. Let'sassume,withoutlossofgenerality,thatweapplytheID3learningprocesstotwo<br />
siteandinvokedremotelytoextractdata.Thiscanbeaccomplished<strong>for</strong>exampleusinga usingsay,CRipper3()thecodeimplementingthisclassierwouldbetransmittedtothefth queryofthe<strong>for</strong>m: inFigure5,arelativelysmallsetofrulesthatiseasilycommunicatedamongdistributed sitesasneeded.5Toextractfrauddatafromadistinctfthsiteofdata,oranyothersite, SelectX.*FromCredit-card-dataWhereCRipper3(X:fraudlabel)=1. AsampleRipperRule-BasedClassierlearnedfromthecreditcarddatasetisdepicted<br />
classiedas\notfraud"wouldresultinnoin<strong>for</strong>mationbeingreturnedatall(ratherthan basedentirelyupontheclassicationslearnedatsite3.Noticethatrequestingtransactions inafrauddetectionsystem. implementeddirectlyasadatalterappliedagainstincomingtransactionsataserversite 5Thespeciccondentialattributenamesarenotrevealedhere. Theendresultofthisqueryisastreamofdataaccessedfromsomeremotesource Naturally,theselectexpressionrenderedhereinSQLinthisexamplecanbeinstead<br />
12
learnedfromthepredictionsofthefourbaseclassiers.Inthisportion,onlytheclassiers creditcardtransactions.Right:SampleportionoftheID3decisiontreemeta-classier Figure5:Left:Thissamplerule-basedmodel,c<strong>over</strong>s1365non-fraudulentand290fraudulent<br />
Fraud :-<br />
Fraud :-<br />
site1<br />
a >= 148, a >= 774.<br />
b >= 695.<br />
1 0<br />
Fraud :-<br />
Fraud :-<br />
site2 site2<br />
c
companionpaperalsoavailablefromhttp://www.cs.columbia.edu/sal/<strong>JAM</strong>/PROJECT. 6FutureResearch Bayesexhibited80%TPand13%FPinonesettingand80%TPand19thethreebaseclassi- classiers.Theexperiments,settings,rationaleandresultshavebeenreportedindetailina erswiththeleastcorrelatederrorandintheseconditcombinedthefourmostaccuratebase<br />
Theirresolutionwillproduceanumberofenhancemements,including: The<strong>JAM</strong>prototypeisinanongoingstateofdevelopment.Anumberofissuesrequirestudy. Anintelligentandecientpairingprocess.Thedatasiteswillbeabletodiscernwhich produce(assuminganequalqualityinthetypeoflearningalgorithms). Amechanism<strong>for</strong>integratingnewknowledge.Moreandmoredatawillbecomeavailable,newtrendswillemergeandnewwaystobypassthedetectionmechanismswill datasitestheycamefrom.Therationalbehindthisisthatthemorecompatible, classierscontributepositivelyandthenchoosetocooperatemorecloselywiththe<br />
similartotheknowledgeimportedfromremotesources.<strong>Meta</strong>-<strong>Learning</strong>techniques bedevised.Wewillexplorewaystoincorporatethenewknowledgewithoutturning thepresentandolddataobsolete.Infact,newknowledgecanbetreatedinafashion representativeanddiverseadatabaseis,thebettertheclassiersthedatasitewill<br />
addressedbe<strong>for</strong>e,butfromadierentperspective.Datasiteswillnotonlyexchange multilevelone.Asmoreandmoreclassiersbecomeavailablewewillneedtoorganize themeciently.Thisisagainahierarchicalapproachrelatedtothescalabilityissue Multilevelmeta-learning.Wewillgeneralizetheonelevelmetalearningapproachtoa way,<strong>JAM</strong>willbecomealifelongsystemconstantlyaccumulatingusefulknowledge. nearfuturetheycanalsobeemployedtocombineknowledgeacquired<strong>over</strong>time.This havebeenalreadyinusetocombineknowledge<strong>over</strong>space(remotedatabases),inthe<br />
example,toimplementCID31()(seesection4)asasub-agentwhollycontainedwithin andhandle<strong>agents</strong>,oreventorealizeagentcompositionoperators.Itispossbile,<strong>for</strong> capabilitiesprovidedbyourproposedsystem,isthatitisveryeasytoimport,invoke happensrecursivelyallthewaydowntotheleavesofthetree).Infact,oneofthekey classiercarrieswithitallitschildrenmetaclassiersthroughwhichitwasbuilt;this baseclassiersbutentire"metaclassiertrees".Ametaclassiertreeisatreein<br />
themeta-classierM.Inthiscase,eachsub-agentisdenedasasimple<strong>Java</strong>object whicheachinternalnodeisametaclassierandeachleafisabaseclassier.(Ameta<br />
callablefromwithinM,andwhenthemeta-classieragentMistransmittedtoa remotesite,thesub-agent\travels"withit. Ascalabledecisionmakingprotocol.AsthenumberofparticipatingDatasitesand availabledatabasesincreases,communication,coordinationandeciencybecome problematic.WewillaugmenteachDatasiteofthesystemwithahierarchicaland distributedprotocoltofacilitatescalabilityandeliminatebottlenecks. 14
Wewillreplacethecurrentcongurationlemanagerprocesswithseverallogically distributedprocessesthatwillinteractwitheachotherinordertorealizescalability, approachtomaintaintheglobalconguration.Thisisanobvioussequentialbottleneck. maintaintheglobalconguration,andsupportfaulttolerance. Adistributedcongurationlemanager.Inthisrstversion<strong>JAM</strong>usesacentralized<br />
7Conclusions<br />
intrusiondetectionfacilitiesinglobal-scale,integratedin<strong>for</strong>mationsystems. temsdeployedasintelligent<strong>agents</strong>willbeanimportantcontributingtechnologytodeploysecuredmeta-learningsystemwillprovidethemeansofusinglargenumbersoflow-costnetworkedcomputerswhocollectivelylearnfrommassivedatabasesusefulandnewknowledge, indevelopingsystemsthatlearnfrommassivedatabasesandthatscale.Adeployedand Webelievetheconceptsembodiedbythetermmeta-learningprovideanimportantstep thatwouldotherwisebeprohibitivelyexpensivetoachieve.Webelievemeta-learningsys-<br />
cardtransactions,providedbydierentbanks,inanattempttodetectandpreventfraudby collaborationwiththeFSTCwehavepopulatedthesedatabasesiteswithrecordsofcredit andportableagent-basedsystemthatsupportsthelaunchingoflearningandmeta-learning learning<strong>agents</strong>.Wehaveengaged<strong>JAM</strong>inareal,practicalandimportantproblem.In<strong>over</strong>allpredictiveaccuracyofanumberofindependentylearnedclassiersthroughmeta- <strong>agents</strong>todistributeddatabasesites.<strong>JAM</strong>canintegratedistributedknowledgeandboost Inthispaperwedescribedthe<strong>JAM</strong>architecture,adistributed,scalable,extensible<br />
combininglearnedpatternsandbehaviorsfromindependentsources.<br />
15
References<br />
[3]P.ChanandS.Stolfo.Towardparallelanddistributedlearningbymeta-learning.In [2]P.Chan.AnExtensible<strong>Meta</strong>-<strong>Learning</strong>Approach<strong>for</strong>ScalableandAccurateInductive [1]L.Breiman,J.H.Friedman,R.A.Olshen,andC.J.Stone.ClassicationandRegression<br />
WorkingNotesAAAIWork.Know.Disc.<strong>Databases</strong>,pages227{240,1993. <strong>Learning</strong>.PhDthesis,DepartmentofComputerScience,ColumbiaUniversity,New York,NY,1996.(<strong>for</strong>thcoming). Trees.Wadsworth,Belmont,CA,1984.<br />
[5]P.ChanandS.Stolfo.<strong>Learning</strong>arbiterandcombinertreesfrompartitioneddata<strong>for</strong> [4]P.ChanandS.Stolfo.Acomparativeevaluationofvotingandmeta-learningonparti[6]WilliamW.Cohen.Fasteectiveruleinduction.InProc.TwelfthInternationalContioneddata.InProc.TwelfthIntl.Conf.Machine<strong>Learning</strong>,pages90{98,1995. scalingmachinelearning.InProc.Intl.Conf.KnowledgeDisc<strong>over</strong>yandDataMining, pages39{44,1995.<br />
[8]NaserS.BarghoutiWenkeLee.<strong>Java</strong>dot:Anextensiblevisualizationenvironment. [7]GregoryPiatetsky-ShapiroUsamaFayyadandPadhraicSmyth.Thekddprocess<strong>for</strong> TechnicalReportCUCS-02-97,DepartmentofComputerScience,ColumbiaUniversity, November1996. extractingusefulknowledgefromdata.CommunicationsoftheACM,39(11):27{34, ference.MorganKaufmann,1995.<br />
[10]X.Zhang,M.Mckenna,J.Mesirov,andD.Waltz.Anecientimplementationofthe [9]D.Wolpert.Stackedgeneralization.NeuralNetworks,5:241{259,1992. ThinkingMachinesCorp.,1989. NewYork,NY,1997. backpropagationalgorithmontheconnectionmachineCM-2.TechnicalReportRL89-1,<br />
16