12.07.2015 Views

Distributed Data Mining in Credit Card Fraud Detection

Distributed Data Mining in Credit Card Fraud Detection

Distributed Data Mining in Credit Card Fraud Detection

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

D A T A M I N I N G<strong>Distributed</strong> <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong><strong>Credit</strong> <strong>Card</strong> <strong>Fraud</strong> <strong>Detection</strong>Philip K. Chan, Florida Institute of TechnologyWei Fan, Andreas L. Prodromidis, and Salvatore J. Stolfo, Columbia UniversityCREDIT CARD TRANSACTIONS CONt<strong>in</strong>ueto grow <strong>in</strong> number, tak<strong>in</strong>g an ever-largershare of the US payment system and lead<strong>in</strong>gto a higher rate of stolen account numbersand subsequent losses by banks. Improvedfraud detection thus has become essential toma<strong>in</strong>ta<strong>in</strong> the viability of the US payment system.Banks have used early fraud warn<strong>in</strong>gsystems for some years.Large-scale data-m<strong>in</strong><strong>in</strong>g techniques canimprove on the state of the art <strong>in</strong> commercialpractice. Scalable techniques to analyze massiveamounts of transaction data that efficientlycompute fraud detectors <strong>in</strong> a timelymanner is an important problem, especiallyfor e-commerce. Besides scalability and efficiency,the fraud-detection task exhibits technicalproblems that <strong>in</strong>clude skewed distributionsof tra<strong>in</strong><strong>in</strong>g data and nonuniform costper error, both of which have not been widelystudied <strong>in</strong> the knowledge-discovery and datam<strong>in</strong><strong>in</strong>gcommunity.In this article, we survey and evaluate anumber of techniques that address these threema<strong>in</strong> issues concurrently. Our proposedmethods of comb<strong>in</strong><strong>in</strong>g multiple learned frauddetectors under a “cost model” are generaland demonstrably useful; our empiricalresults demonstrate that we can significantlyreduce loss due to fraud through distributeddata m<strong>in</strong><strong>in</strong>g of fraud models.THIS SCALABLE BLACK-BOX APPROACH FOR BUILDINGEFFICIENT FRAUD DETECTORS CAN SIGNIFICANTLY REDUCELOSS DUE TO ILLEGITIMATE BEHAVIOR. IN MANY CASES, THEAUTHORS’ METHODS OUTPERFORM A WELL-KNOWN, STATE-OF-THE-ART COMMERCIAL FRAUD-DETECTION SYSTEM.Our approachIn today’s <strong>in</strong>creas<strong>in</strong>gly electronic societyand with the rapid advances of electronic commerceon the Internet, the use of credit cardsfor purchases has become convenient and necessary.<strong>Credit</strong> card transactions have becomethe de facto standard for Internet and Webbasede-commerce. The US government estimatesthat credit cards accounted for approximatelyUS $13 billion <strong>in</strong> Internet sales dur<strong>in</strong>g1998. This figure is expected to grow rapidlyeach year. However, the grow<strong>in</strong>g number ofcredit card transactions provides more opportunityfor thieves to steal credit card numbersand subsequently commit fraud. When bankslose money because of credit card fraud, cardholderspay for all of that loss through higher<strong>in</strong>terest rates, higher fees, and reduced benefits.Hence, it is <strong>in</strong> both the banks’ and thecardholders’<strong>in</strong>terest to reduce illegitimate useof credit cards by early fraud detection. Formany years, the credit card <strong>in</strong>dustry has studiedcomput<strong>in</strong>g models for automated detectionsystems; recently, these models have beenthe subject of academic research, especiallywith respect to e-commerce.The credit card fraud-detection doma<strong>in</strong>presents a number of challeng<strong>in</strong>g issues fordata m<strong>in</strong><strong>in</strong>g:• There are millions of credit card transactionsprocessed each day. <strong>M<strong>in</strong><strong>in</strong>g</strong> suchmassive amounts of data requires highlyefficient techniques that scale.• The data are highly skewed—many moretransactions are legitimate than fraudulent.Typical accuracy-based m<strong>in</strong><strong>in</strong>g techniquescan generate highly accurate frauddetectors by simply predict<strong>in</strong>g that allNOVEMBER/DECEMBER 1999 1094-7167/99/$10.00 © 1999 IEEE 67


The AdaCost algorithmOne of the most important results of our experimental work on thisdoma<strong>in</strong> was the realization that each of the base-learn<strong>in</strong>g algorithmsemployed <strong>in</strong> our experiments utilized <strong>in</strong>ternal heuristics based on tra<strong>in</strong><strong>in</strong>gaccuracy, not cost. This leads us to <strong>in</strong>vestigate new algorithms thatemploy <strong>in</strong>ternal metrics of misclassification cost when comput<strong>in</strong>g hypothesesto predict fraud.Here, we describe AdaCost 1 (a variant of AdaBoost 2,3 ), which reducesboth fixed and variable misclassification costs more significantly thanAdaBoost. We follow the generalized analysis of AdaBoost by RobertSchapire and Yoram S<strong>in</strong>ger. 3 Figure A shows the algorithm. Let S =((x 1 ,c 1 ,y 1 ), …, (x m ,c m ,y m )) be a sequence of tra<strong>in</strong><strong>in</strong>g examples where each<strong>in</strong>stance x i belongs to a doma<strong>in</strong> X, each cost factor c i belongs to the nonnegativereal doma<strong>in</strong> R + , and each label y i belongs to a f<strong>in</strong>ite label spaceY. We only focus on b<strong>in</strong>ary classification problems <strong>in</strong> which Y = {–1,+1}. h is a weak hypothesis—it has the form h:X – R. The sign of h(x) is<strong>in</strong>terpreted as the predicted label, and the magnitude |h(x)| is the “confidence”<strong>in</strong> this prediction. Let t be an <strong>in</strong>dex to show the round of boost<strong>in</strong>gand D t (i) be the weight given to (x i ,c i ,y i ) at the tth round. 0 ≤ D t (i) ≤1, andΣD t (i) = 1 is the chosen parameter as a weight for weak hypothesis h t atthe tth round. We assume α t > 0. β(sign(y i h t (x i )),c i ) is a cost-adjustmentfunction with two arguments: sign(y i h t (x i )) to show if h t (x i ) is correct, andthe cost factor c i .The difference between AdaCost and AdaBoost is the additional costadjustmentfunction β(sign(y i h t (x i )),c i ) <strong>in</strong> the weight-updat<strong>in</strong>g rule.Given: (x1, c1, y1), …,(x m , c m , y m ):x i ∈ X,c i ∈ R+,y i ∈ {–1,+1}Initialize D1(i) (such as D 1 (i) = (such as D 1 (i) = c i /∑ m j c j )For t = 1, …,T:1. Tra<strong>in</strong> weak learner us<strong>in</strong>g distribution D t .2. Compute weak hypothesis h t : X→R.3. Choose α t ∈R and β(i) ∈ R +4. UpdateFigure A. AdaCost.Dt i − tyh i t xiy h x cDtii t i i+ 1 () ( ) exp α ( ) β( sign( ( )), )=Ztwhere β(sign(y i h t (x i )),c i ) is a cost-adjustment function.Z t is a normalization factor chosen so that D t+1will be a distribution.Output the f<strong>in</strong>al hypothesis:H(x) = sign(f(x)) where f(x) =( )⎛ T ⎞⎜∑α t h t ( x)⎟⎝ t=1 ⎠Where it is clear <strong>in</strong> context, we use either β(i) or β(c i ) as a shorthand forβ(sign(y i h t (x i )),c i ). Furthermore, we use β + when sign (y i h t (x i )) = +1 andβ – when sign(y i h t (x i )) = –1. For an <strong>in</strong>stance with a higher cost factor, β(i)<strong>in</strong>creases its weights “more” if the <strong>in</strong>stance is misclassified, butdecreases its weight “less” otherwise. Therefore, we require β – (c i ) to benondecreas<strong>in</strong>g with respect to c i ,β + (c i ) to be non<strong>in</strong>creas<strong>in</strong>g, and both arenonnegative. We proved that AdaCost reduces cost on the tra<strong>in</strong><strong>in</strong>g data. 1Logically, we can assign a cost factor c of tranamt – overhead tofrauds and a factor c of overhead to nonfrauds. This reflects how the predictionerrors will add to the total cost of a hypothesis. Because theactual overhead is a closely guarded trade secret and is unknown to us,we chose to set overhead ∈ {60, 70, 80, 90} to run four sets ofexperiments. We normalized each c i to [0,1]. The cost adjustment functionβ is chosen as: β – (c) = 0.5⋅c + 0.5 and β + (c) = –0.5⋅c + 0.5.As <strong>in</strong> previous experiments, we use tra<strong>in</strong><strong>in</strong>g data from one month anddata from two months later for test<strong>in</strong>g. Our data set let us form 10 suchpairs of tra<strong>in</strong><strong>in</strong>g and test sets. We ran both AdaBoost and AdaCost to the50th round. We used Ripper as the “weak” learner because it provides aneasy way to change the tra<strong>in</strong><strong>in</strong>g set’s distribution. Because us<strong>in</strong>g thetra<strong>in</strong><strong>in</strong>g set alone usually overestimates a rule set’s accuracy, we used theLaplace estimate to generate the confidence for each rule.We are <strong>in</strong>terested <strong>in</strong> compar<strong>in</strong>g Ripper (as a basel<strong>in</strong>e), AdaCost, andAdaBoost <strong>in</strong> several dimensions. First, for each data set and cost model,we determ<strong>in</strong>e which algorithm has achieved the lowest cumulative misclassificationcost. We wish to know <strong>in</strong> how many cases AdaCost is theclear w<strong>in</strong>ner. Second, we also seek to know, quantitatively, the difference<strong>in</strong> cumulative misclassification cost of AdaCost from AdaBoost and thebasel<strong>in</strong>e Ripper. It is <strong>in</strong>terest<strong>in</strong>g to measure the significance of these differences<strong>in</strong> terms of both reduction <strong>in</strong> misclassification loss and percentageof reduction. F<strong>in</strong>ally, we are <strong>in</strong>terested to know if AdaCost requiresmore comput<strong>in</strong>g power.Figure B plots the results from the Chase Bank’s credit card data. FigureB1 shows the average reduction of 10 months <strong>in</strong> percentage cumulativeloss (def<strong>in</strong>ed as cumulativeloss/ maximalloss – leastloss * 100%) forAdaBoost and AdaCost for all 50 rounds with an overhead of $60 (resultsfor other overheads are <strong>in</strong> other work 1 ). We can clearly see that there is aconsistent reduction. The absolute amount of reduction is around 3%We also observe that the speed of reduction by AdaCost is quickerthan that of AdaBoost. The speed is the highest <strong>in</strong> the first few rounds.This means that <strong>in</strong> practice, we might not need to run AdaCost for manyrounds. Figure B2 plots the ratio of cumulative cost by AdaCost andAdaBoost. We have plotted the results of all 10 pairs of tra<strong>in</strong><strong>in</strong>g and testmonths over all rounds and overheads. Most of the po<strong>in</strong>ts are above the y= x l<strong>in</strong>e <strong>in</strong> Figure B2, imply<strong>in</strong>g that AdaCost has lower cumulative loss <strong>in</strong>an overwhelm<strong>in</strong>g number of cases.References1. W. Fan et al., “Adacost: Misclassification Cost-SensitiveBoost<strong>in</strong>g,” Proc. 16th Int’l Conf. Mach<strong>in</strong>e Learn<strong>in</strong>g, Morgan Kaufmann,San Francisco, 1999, pp. 97–105.transactions are legitimate, although thisis equivalent to not detect<strong>in</strong>g fraud at all.• Each transaction record has a differentdollar amount and thus has a variablepotential loss, rather than a fixed misclassificationcost per error type, as is commonlyassumed <strong>in</strong> cost-based m<strong>in</strong><strong>in</strong>gtechniques.Our approach addresses the efficiency andscalability issues <strong>in</strong> several ways. We dividea large data set of labeled transactions (eitherfraudulent or legitimate) <strong>in</strong>to smaller subsets,apply m<strong>in</strong><strong>in</strong>g techniques to generate classifiers<strong>in</strong> parallel, and comb<strong>in</strong>e the resultant basemodels by metalearn<strong>in</strong>g from the classifiers’behavior to generate a metaclassifier. 1 Ourapproach treats the classifiers as black boxesso that we can employ a variety of learn<strong>in</strong>galgorithms. Besides extensibility, comb<strong>in</strong><strong>in</strong>gmultiple models computed over all availabledata produces metaclassifiers that can offsetthe loss of predictive performance that usuallyoccurs when m<strong>in</strong><strong>in</strong>g from data subsets orsampl<strong>in</strong>g. Furthermore, when we use thelearned classifiers (for example, dur<strong>in</strong>g transactionauthorization), the base classifiers canexecute <strong>in</strong> parallel, with the metaclassifier thencomb<strong>in</strong><strong>in</strong>g their results. So, our approach ishighly efficient <strong>in</strong> generat<strong>in</strong>g these models andalso relatively efficient <strong>in</strong> apply<strong>in</strong>g them.Another parallel approach focuses on paralleliz<strong>in</strong>ga particular algorithm on a particu-68 IEEE INTELLIGENT SYSTEMS


Percentage cumulative misclassification cost0.3950.390.3850.380.3750.370.3650.360 5 10 15 20 25 30 35 40 45 50(1)Boost<strong>in</strong>g RoundsAdaBoost cumulative cost(2)0.520.50.480.460.440.420.40.380.360.340.322. Y. Freund and R. Schapire, “Experiments with a New Boost<strong>in</strong>g Algorithm,” Proc. 13th Conf.Mach<strong>in</strong>e Learn<strong>in</strong>g, Morgan Kaufmann, San Francisco, 1996, pp. 148–156.3. R. Schapire and Y. S<strong>in</strong>ger, “Improved Boost<strong>in</strong>g Algorithms Us<strong>in</strong>g Confidence-Rated Predictions,”Proc. 11th Conf. Computational Learn<strong>in</strong>g Theory, ACM Press, New York, 1998.lar parallel architecture. However, a new algorithmor architecture requires a substantialamount of parallel-programm<strong>in</strong>g work. Althoughour architecture- and algorithm-<strong>in</strong>dependentapproach is not as efficient as somef<strong>in</strong>e-gra<strong>in</strong>ed parallelization approaches, it letsusers plug different off-the-shelf learn<strong>in</strong>g programs<strong>in</strong>to a parallel and distributed environmentwith relative ease and elim<strong>in</strong>ates theneed for expensive parallel hardware.Furthermore, because our approach couldAdaBoostAdaCostOverhead = 60y=xPercentagel of AdaCost and AdaBoost0.30.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5AdaCost cumulative costFigure B. Cumulative cost versus rounds (1); AdaBoost versus AdaCost <strong>in</strong> cumulative cost (2).generate a potentially large number of classifiersfrom the concurrently processed datasubsets, and therefore potentially require morecomputational resources dur<strong>in</strong>g detection, we<strong>in</strong>vestigate prun<strong>in</strong>g methods that identifyredundant classifiers and remove them fromthe ensemble without significantly degrad<strong>in</strong>gthe predictive performance. This prun<strong>in</strong>g technique<strong>in</strong>creases the learned detectors’ computationalperformance and throughput.The issue of skewed distributions has notbeen studied widely because many of the datasets used <strong>in</strong> research do not exhibit this characteristic.We address skewness by partition<strong>in</strong>gthe data set <strong>in</strong>to subsets with a desired distribution,apply<strong>in</strong>g m<strong>in</strong><strong>in</strong>g techniques to thesubsets, and comb<strong>in</strong><strong>in</strong>g the m<strong>in</strong>ed classifiersby metalearn<strong>in</strong>g (as we have already discussed).Other researchers attempt to removeunnecessary <strong>in</strong>stances from the majorityclass—<strong>in</strong>stances that are <strong>in</strong> the borderl<strong>in</strong>eregion (noise or redundant exemplars) are candidatesfor removal. In contrast, our approachkeeps all the data for m<strong>in</strong><strong>in</strong>g and does notchange the underly<strong>in</strong>g m<strong>in</strong><strong>in</strong>g algorithms.We address the issue of nonuniform costby develop<strong>in</strong>g the appropriate cost model forthe credit card fraud doma<strong>in</strong> and bias<strong>in</strong>g ourmethods toward reduc<strong>in</strong>g cost. This costmodel determ<strong>in</strong>es the desired distributionjust mentioned. AdaCost (a cost-sensitiveversion of AdaBoost) relies on the cost modelfor updat<strong>in</strong>g weights <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g distribution.(For more on AdaCost, see the “Ada-Cost algorithm” sidebar.) Naturally, this costmodel also def<strong>in</strong>es the primary evaluationcriterion for our techniques. Furthermore, we<strong>in</strong>vestigate techniques to improve the costperformance of a bank’s fraud detector byimport<strong>in</strong>g remote classifiers from otherbanks and comb<strong>in</strong><strong>in</strong>g this remotely learnedknowledge with locally stored classifiers.The law and competitive concerns restrictbanks from shar<strong>in</strong>g <strong>in</strong>formation about theircustomers with other banks. However, theymay share black-box fraud-detection models.Our distributed data-m<strong>in</strong><strong>in</strong>g approachprovides a direct and efficient solution toshar<strong>in</strong>g knowledge without shar<strong>in</strong>g data. Wealso address possible <strong>in</strong>compatibility of dataschemata among different banks.We designed and developed an agentbaseddistributed environment to demonstrateour distributed and parallel data-m<strong>in</strong><strong>in</strong>gtechniques. The JAM (Java Agents forMetalearn<strong>in</strong>g) system not only provides distributeddata-m<strong>in</strong><strong>in</strong>g capabilities, it also letsusers monitor and visualize the various learn<strong>in</strong>gagents and derived models <strong>in</strong> real time.Researchers have studied a variety of algorithmsand techniques for comb<strong>in</strong><strong>in</strong>g multiplecomputed models. The JAM system providesgeneric features to easily implementany of these comb<strong>in</strong><strong>in</strong>g techniques (as wellas a large collection of base-learn<strong>in</strong>g algorithms),and it has been broadly available foruse. The JAM system is available for downloadat http://www.cs.columbia.edu/~sal/JAM/PROJECT. 2NOVEMBER/DECEMBER 1999 69


Table 1. Cost model assum<strong>in</strong>g a fixed overhead.<strong>Credit</strong> card data and costmodelsChase Bank and First Union Bank, membersof the F<strong>in</strong>ancial Services TechnologyConsortium (FSTC), provided us with realcredit card data for this study. The two datasets conta<strong>in</strong> credit card transactions labeledas fraudulent or legitimate. Each bank supplied500,000 records spann<strong>in</strong>g one year with20% fraud and 80% nonfraud distribution forChase Bank and 15% versus 85% for FirstUnion Bank. In practice, fraudulent transactionsare much less frequent than the 15% to20% observed <strong>in</strong> the data given to us. Thesedata might have been cases where the bankshave difficulty <strong>in</strong> determ<strong>in</strong><strong>in</strong>g legitimacy correctly.In some of our experiments, we deliberatelycreate more skewed distributions toevaluate the effectiveness of our techniquesunder more extreme conditions.Bank personnel developed the schemataof the databases over years of experience andcont<strong>in</strong>uous analysis to capture important<strong>in</strong>formation for fraud detection. We cannotreveal the details of the schema beyond whatwe have described elsewhere. 2 The recordsof one schema have a fixed length of 137bytes each and about 30 attributes, <strong>in</strong>clud<strong>in</strong>gthe b<strong>in</strong>ary class label (fraudulent/legitimatetransaction). Some fields are numeric and therest categorical. Because account identificationis not present <strong>in</strong> the data, we cannotgroup transactions <strong>in</strong>to accounts. Therefore,<strong>in</strong>stead of learn<strong>in</strong>g behavior models of <strong>in</strong>dividualcustomer accounts, we build overallmodels that try to differentiate legitimatetransactions from fraudulent ones. Our modelsare customer-<strong>in</strong>dependent and can serveas a second l<strong>in</strong>e of defense, the first be<strong>in</strong>gcustomer-dependent models.Most mach<strong>in</strong>e-learn<strong>in</strong>g literature concentrateson model accuracy (either tra<strong>in</strong><strong>in</strong>g erroror generalization error on hold-out test datacomputed as overall accuracy, true-positive orfalse-positive rates, or return-on-cost analysis).This doma<strong>in</strong> provides a considerably differentmetric to evaluate the learned models’performance—modelsare evaluated and rated by acost model. Due to the different dollar amountof each credit card transaction and other factors,the cost of fail<strong>in</strong>g to detect a fraud varieswith each transaction. Hence, the cost modelfor this doma<strong>in</strong> relies on the sum and averageof loss caused by fraud. We def<strong>in</strong>enCumulativeCost = ∑ Cost()iiandOUTCOMEAverageCost =CumulativeCostnwhere Cost(i) is the cost associated withtransactions i, and n is the total number oftransactions.After consult<strong>in</strong>g with a bank representative,we jo<strong>in</strong>tly settled on a simplified costmodel that closely reflects reality. Because ittakes time and personnel to <strong>in</strong>vestigate apotentially fraudulent transaction, each <strong>in</strong>vestigation<strong>in</strong>curs an overhead. Other relatedcosts—for example, the operational resourcesneeded for the fraud-detection system—areconsolidated <strong>in</strong>to overhead. So, if the amountof a transaction is smaller than the overhead,<strong>in</strong>vestigat<strong>in</strong>g the transaction is not worthwhileeven if it is suspicious. For example, if it takes$10 to <strong>in</strong>vestigate a potential loss of $1, it ismore economical not to <strong>in</strong>vestigate it. Therefore,assum<strong>in</strong>g a fixed overhead, we devisedthe cost model shown <strong>in</strong> Table 1 for eachtransaction, where tranamt is the amount of acredit card transaction. The overhead threshold,for obvious reasons, is a closely guardedsecret and varies over time. The range of valuesused here are probably reasonable levelsas bounds for this data set, but are probablysignificantly lower. We evaluated all ourempirical studies us<strong>in</strong>g this cost model.Skewed distributionsCOSTMiss (false negative—FN) TranamtFalse alarm (false positive—FP) Overhead if tranamt > overhead or 0 if tranamt ≤ overheadHit (true positive—TP)Overhead if tranamt > overhead or tranamt if tranamt ≤ overheadNormal (true negative—TN) 0Given a skewed distribution, we would liketo generate a tra<strong>in</strong><strong>in</strong>g set of labeled transactionswith a desired distribution without remov<strong>in</strong>gany data, which maximizes classifier performance.In this doma<strong>in</strong>, we found that determ<strong>in</strong><strong>in</strong>gthe desired distribution is an experimentalart and requires extensive empirical teststo f<strong>in</strong>d the most effective tra<strong>in</strong><strong>in</strong>g distribution.In our approach, we first create data subsetswith the desired distribution (determ<strong>in</strong>edby extensive sampl<strong>in</strong>g experiments). Thenwe generate classifiers from these subsetsand comb<strong>in</strong>e them by metalearn<strong>in</strong>g fromtheir classification behavior. For example, ifthe given skewed distribution is 20:80 andthe desired distribution for generat<strong>in</strong>g thebest models is 50:50, we randomly divide themajority <strong>in</strong>stances <strong>in</strong>to four partitions andform four data subsets by merg<strong>in</strong>g the m<strong>in</strong>ority<strong>in</strong>stances with each of the four partitionsconta<strong>in</strong><strong>in</strong>g majority <strong>in</strong>stances. That is, them<strong>in</strong>ority <strong>in</strong>stances replicate across four datasubsets to generate the desired 50:50 distributionfor each distributed tra<strong>in</strong><strong>in</strong>g set.For concreteness, let N be the size of the dataset with a distribution of x:y (x is the percentageof the m<strong>in</strong>ority class) and u:v be the desireddistribution. The number of m<strong>in</strong>ority <strong>in</strong>stancesis N × x, and the desired number of majority<strong>in</strong>stances <strong>in</strong> a subset is Nx × v/u. The numberof subsets is the number of majority <strong>in</strong>stances(N × y) divided by the number of desiredmajority <strong>in</strong>stances <strong>in</strong> each subset, which is Nydivided by Nxv/u or y/x × u/v. So, we havey/x × u/v subsets, each of which has Nx m<strong>in</strong>ority<strong>in</strong>stances and Nxv/u majority <strong>in</strong>stances.The next step is to apply a learn<strong>in</strong>g algorithmor algorithms to each subset. Becausethe learn<strong>in</strong>g processes on the subsets are <strong>in</strong>dependent,the subsets can be distributed to differentprocessors and each learn<strong>in</strong>g processcan run <strong>in</strong> parallel. For massive amounts ofdata, our approach can substantially improvespeed for superl<strong>in</strong>ear time-learn<strong>in</strong>g algorithms.The generated classifiers are comb<strong>in</strong>edby metalearn<strong>in</strong>g from their classificationbehavior. We have described severalmetalearn<strong>in</strong>g strategies elsewhere. 1To simplify our discussion, we onlydescribe the class-comb<strong>in</strong>er (or stack<strong>in</strong>g)strategy. 3 This strategy composes a metaleveltra<strong>in</strong><strong>in</strong>g set by us<strong>in</strong>g the base classifiers’predictionson a validation set as attribute valuesand the actual classification as the classlabel. This tra<strong>in</strong><strong>in</strong>g set then serves for tra<strong>in</strong><strong>in</strong>ga metaclassifier. For <strong>in</strong>tegrat<strong>in</strong>g subsets,the class-comb<strong>in</strong>er strategy is more effectivethan the vot<strong>in</strong>g-based techniques. When thelearned models are used dur<strong>in</strong>g onl<strong>in</strong>e frauddetection, transactions feed <strong>in</strong>to the learnedbase classifiers and the metaclassifier thencomb<strong>in</strong>es their predictions. Aga<strong>in</strong>, the baseclassifiers are <strong>in</strong>dependent and can execute<strong>in</strong> parallel on different processors. In addition,our approach can prune redundant baseclassifiers without affect<strong>in</strong>g the cost performance,mak<strong>in</strong>g it relatively efficient <strong>in</strong> thecredit card authorization process.70 IEEE INTELLIGENT SYSTEMS


Table 2. Cost and sav<strong>in</strong>gs <strong>in</strong> the credit card fraud doma<strong>in</strong> us<strong>in</strong>g class-comb<strong>in</strong>er (cost ± 95% confidence <strong>in</strong>terval).OVERHEAD = $50 OVERHEAD = $75 OVERHEAD = $100FRAUD SAVED SAVED SAVED SAVED SAVED SAVEDMETHOD (%) COST (%) ($K) COST (%) ($K) COST (%) ($K)Class comb<strong>in</strong>er 50 17.96 ±.14 51 761 20.07 ±.13 46 676 21.87 ±.12 41 604S<strong>in</strong>gle CART 50 20.81 ±.75 44 647 23.64 ±.96 36 534 26.05 ± 1.25 30 437Class comb<strong>in</strong>er Given 22.61 39 575 23.99 35 519 25.20 32 471Average s<strong>in</strong>gleclassifier Given 27.97 ± 1.64 24 360 29.08 ± 1.60 21 315 30.02 ± 1.56 19 278COTS N/A 25.44 31 461 26.40 29 423 27.24 26 389Experiments and results. To evaluate ourmulticlassifier class-comb<strong>in</strong>er approach toskewed class distributions, we performed aset of experiments us<strong>in</strong>g the credit card frauddata from Chase. 4 We used transactions fromthe first eight months (10/95–5/96) for tra<strong>in</strong><strong>in</strong>g,the n<strong>in</strong>th month (6/96) for validat<strong>in</strong>g,and the twelfth month (9/96) for test<strong>in</strong>g.(Because credit card transactions have a naturaltwo-month bus<strong>in</strong>ess cycle—the time tobill a customer is one month, followed by aone-month payment period—the true labelof a transaction cannot be determ<strong>in</strong>ed <strong>in</strong> lessthan two months’time. Hence, build<strong>in</strong>g modelsfrom data <strong>in</strong> one month cannot be rationallyapplied for fraud detection <strong>in</strong> the nextmonth. We therefore test our models on datathat is at least two months newer.)Based on the empirical results from theeffects of class distributions, the desired distributionis 50:50. Because the given distributionis 20:80, four subsets are generated from eachmonth for a total of 32 subsets. We applied fourlearn<strong>in</strong>g algorithms (C4.5, CART, Ripper, andBayes) to each subset and generated 128 baseclassifiers. Based on our experience with tra<strong>in</strong><strong>in</strong>gmetaclassifiers, Bayes is generally moreeffective and efficient, so it is the metalearnerfor all the experiments reported here.Furthermore, to <strong>in</strong>vestigate if our approachis <strong>in</strong>deed fruitful, we ran experiments on theclass-comb<strong>in</strong>er strategy directly applied tothe orig<strong>in</strong>al data sets from the first eightmonths (that is, they have the given 20:80distribution). We also evaluated how <strong>in</strong>dividualclassifiers generated from each monthperform without class-comb<strong>in</strong><strong>in</strong>g.Table 2 shows the cost and sav<strong>in</strong>gs from theclass-comb<strong>in</strong>er strategy us<strong>in</strong>g the 50:50 distribution(128 base classifiers), the average of<strong>in</strong>dividual CART classifiers generated us<strong>in</strong>gthe desired distribution (10 classifiers), classcomb<strong>in</strong>erus<strong>in</strong>g the given distribution (32 baseclassifiers—8 months × 4 learn<strong>in</strong>g algorithms),and the average of <strong>in</strong>dividual classifiersus<strong>in</strong>g the given distribution (the averageof 32 classifiers). (We did not perform experimentson simply replicat<strong>in</strong>g the m<strong>in</strong>ority<strong>in</strong>stances to achieve 50:50 <strong>in</strong> one s<strong>in</strong>gle dataset because this approach <strong>in</strong>creases the tra<strong>in</strong><strong>in</strong>g-setsize and is not appropriate <strong>in</strong> doma<strong>in</strong>swith large amounts of data—one of the threeprimary issues we address here.) Comparedto the other three methods, class-comb<strong>in</strong><strong>in</strong>gon subsets with a 50:50 fraud distributionclearly achieves a significant <strong>in</strong>crease <strong>in</strong> sav<strong>in</strong>gs—atleast $110,000 for the month (6/96).When the overhead is $50, more than half ofthe losses were prevented.Surpris<strong>in</strong>gly, we also observe that when theoverhead is $50, a classifier (S<strong>in</strong>gle CART)tra<strong>in</strong>ed from one month’s data with the desired50:50 distribution (generated by ignor<strong>in</strong>gsome data) achieved significantly more sav<strong>in</strong>gsthan comb<strong>in</strong><strong>in</strong>g classifiers tra<strong>in</strong>ed fromall eight months’data with the given distribution.This reaffirms the importance of employ<strong>in</strong>gthe appropriate tra<strong>in</strong><strong>in</strong>g class distribution<strong>in</strong> this doma<strong>in</strong>. Class-comb<strong>in</strong>er also contributedto the performance improvement.Consequently, utiliz<strong>in</strong>g the desired tra<strong>in</strong><strong>in</strong>gdistribution and class-comb<strong>in</strong>er provides asynergistic approach to data m<strong>in</strong><strong>in</strong>g withnonuniform class and cost distributions. Perhapsmore importantly, how do our techniquesperform compared to the bank’s exist<strong>in</strong>gfraud-detection system? We label the currentsystem “COTS” (commercial off-the-shelfsystem) <strong>in</strong> Table 2. COTS achieved significantlyless sav<strong>in</strong>gs than our techniques <strong>in</strong> thethree overhead amounts we report <strong>in</strong> this table.This comparison might not be entirelyaccurate because COTS has much moretra<strong>in</strong><strong>in</strong>g data than we have and it might beoptimized to a different cost model (whichmight even be the simple error rate). Furthermore,unlike COTS, our techniques aregeneral for problems with skewed distributionsand do not utilize any doma<strong>in</strong> knowledge<strong>in</strong> detect<strong>in</strong>g credit card fraud—the onlyexception is the cost model used for evaluationand search guidance. Nevertheless,COTS’performance on the test data providessome <strong>in</strong>dication of how the exist<strong>in</strong>g frauddetectionsystem behaves <strong>in</strong> the real world.We also evaluated our method with moreskewed distributions (by downsampl<strong>in</strong>gm<strong>in</strong>ority <strong>in</strong>stances): 10:90, 1:99, and 1:999.As we discussed earlier, the desired distributionis not necessarily 50:50—for <strong>in</strong>stance,the desired distribution is 30:70 when thegiven distribution is 10:90. With 10:90 distributions,our method reduced the cost significantlymore than COTS. With 1:99 distributions,our method did not outperformCOTS. Both methods did not achieve anysav<strong>in</strong>gs with 1:999 distributions.To characterize the condition when ourtechniques are effective, we calculate R, theratio of the overhead amount to the averagecost: R = Overhead/Average cost. Our approachis significantly more effective thanthe deployed COTS when R < 6. Both methodsare not effective when the R > 24. So,under a reasonable cost model with a fixedoverhead cost <strong>in</strong> challeng<strong>in</strong>g transactions aspotentially fraudulent, when the number offraudulent transactions is a very small percentageof the total, it is f<strong>in</strong>ancially undesirableto detect fraud. The loss due to this fraudis yet another cost of conduct<strong>in</strong>g bus<strong>in</strong>ess.However, filter<strong>in</strong>g out “easy” (or low-risk)transactions (the data we received were possiblyfiltered by a similar process) can reducea high overhead-to-loss ratio. The filter<strong>in</strong>gprocess can use fraud detectors that are builtbased on <strong>in</strong>dividual customer profiles, whichare now <strong>in</strong> use by many credit card companies.These <strong>in</strong>dividual profiles characterizethe customers’ purchas<strong>in</strong>g behavior. Forexample, if a customer regularly buys groceriesat a particular supermarket or has setup a monthly payment for phone bills, thesetransactions are close to no risk; hence. purchasesof similar characteristics can be safelyauthorized without further check<strong>in</strong>g. Reduc<strong>in</strong>gthe overhead through streaml<strong>in</strong><strong>in</strong>g bus<strong>in</strong>essoperations and <strong>in</strong>creased automationwill also lower the ratio.Knowledge shar<strong>in</strong>g throughbridg<strong>in</strong>gMuch of the prior work on comb<strong>in</strong><strong>in</strong>gmultiple models assumes that all modelsorig<strong>in</strong>ate from different (not necessarily dis-NOVEMBER/DECEMBER 1999 71


t<strong>in</strong>ct) subsets of a s<strong>in</strong>gle data set as a meansto <strong>in</strong>crease accuracy (for example, by impos<strong>in</strong>gprobability distributions over the <strong>in</strong>stancesof the tra<strong>in</strong><strong>in</strong>g set, or by stratifiedsampl<strong>in</strong>g, subsampl<strong>in</strong>g, and so forth) and notas a means to <strong>in</strong>tegrate distributed <strong>in</strong>formation.Although the JAM system addresses thelatter problem by employ<strong>in</strong>g metalearn<strong>in</strong>gtechniques, <strong>in</strong>tegrat<strong>in</strong>g classification modelsderived from dist<strong>in</strong>ct and distributed databasesmight not always be feasible.In all cases considered so far, all classificationmodels are assumed to orig<strong>in</strong>ate fromdatabases of identical schemata. Becauseclassifiers depend directly on the underly<strong>in</strong>gdata’s format, m<strong>in</strong>or differences <strong>in</strong> the schematabetween databases derive <strong>in</strong>compatibleclassifiers—that is, a classifier cannot beapplied on data of different formats. Yet theseclassifiers may target the same concept. Weseek to bridge these disparate classifiers <strong>in</strong>some pr<strong>in</strong>cipled fashion.The banks seek to be able to exchangetheir classifiers and thus <strong>in</strong>corporate useful<strong>in</strong>formation <strong>in</strong> their system that would otherwisebe <strong>in</strong>accessible to both. Indeed, foreach credit card transaction, both <strong>in</strong>stitutionsrecord similar <strong>in</strong>formation, however, theyalso <strong>in</strong>clude specific fields conta<strong>in</strong><strong>in</strong>g important<strong>in</strong>formation that each has acquired separatelyand that provides predictive value <strong>in</strong>determ<strong>in</strong><strong>in</strong>g fraudulent transaction patterns.To facilitate the exchange of knowledge (representedas remotely learned models) andtake advantage of <strong>in</strong>compatible and otherwiseuseless classifiers, we need to devisemethods that bridge the differences imposedby the different schemata.<strong>Data</strong>base compatibility. The <strong>in</strong>compatibleschema problem impedes JAM from tak<strong>in</strong>gadvantage of all available databases. Let’sconsider two data sites A and B with databasesDB A and DB B , hav<strong>in</strong>g similar but notidentical schemata. Without loss of generality,we assume thatSchema(DB A ) = {A 1 , A 2 , ..., A n , A n+1 , C}Schema(DB B ) = {B 1 , B 2 , ..., B n , B n+1 , C}where, A i and B i denote the ith attribute ofDB A and DB B , respectively, and C the classlabel (for example, the fraud/legitimate label<strong>in</strong> the credit card fraud illustration) of each<strong>in</strong>stance. Without loss of generality, we furtherassume that A i = B i ,1 ≤ i ≤ n. As for theA n+1 and B n+1 attributes, there are twopossibilities:1. A n+1 ≠ B n+1 : The two attributes are ofentirely different types drawn from dist<strong>in</strong>ctdoma<strong>in</strong>s. The problem can then bereduced to two dual problems where onedatabase has one more attribute than theother; that is,Schema(DB A ) = {A 1 ,A 2 ,..., A n ,A n+1 ,C}Schema(DB B )= {B 1 ,B 2 ,..., B n ,C}where we assume that attribute B n+1 isnot present <strong>in</strong> DB B . (The dual problemhas DB B composed with B n+1, but A n+1 isnot available to A.)2. A n+1 ≈ B n+1 : The two attributes are ofsimilar type but slightly differentsemantics—that is, there might be amap from the doma<strong>in</strong> of one type to thedoma<strong>in</strong> of the other. For example, A n+1and B n+1 are fields with time-dependent<strong>in</strong>formation but of different duration(that is, A n+1 might denote the numberof times an event occurred with<strong>in</strong> a w<strong>in</strong>dowof half an hour and B n+1 mightdenote the number of times the sameevent occurred but with<strong>in</strong> 10 m<strong>in</strong>utes).In both cases (attribute A n+1 is either notpresent <strong>in</strong> DB B or semantically different fromthe correspond<strong>in</strong>g B n+1 ), the classifiers C Ajderived from DB A are not compatible withDB B ’s data and therefore cannot be directlyused <strong>in</strong> DB B ’s site and vice versa. But the purposeof us<strong>in</strong>g a distributed data-m<strong>in</strong><strong>in</strong>g systemand deploy<strong>in</strong>g learn<strong>in</strong>g agents and metalearn<strong>in</strong>gtheir classifier agents is to be able to comb<strong>in</strong>e<strong>in</strong>formation from different sources.Bridg<strong>in</strong>g methods. There are two methodsfor handl<strong>in</strong>g the miss<strong>in</strong>g attributes. 5Method I: Learn a local model for the miss<strong>in</strong>gattribute and exchange. <strong>Data</strong>base DB Bimports, along with the remote classifieragent, a bridg<strong>in</strong>g agent from database DB Athat is tra<strong>in</strong>ed to compute values for the miss<strong>in</strong>gattribute A n+1 <strong>in</strong> DB B ’s data. Next, DB Bapplies the bridg<strong>in</strong>g agent on its own data toestimate the miss<strong>in</strong>g values. In many cases,however, this might not be possible or desirableby DB A (for example, <strong>in</strong> case theattribute is proprietary). The alternative fordatabase DB B is to simply add the miss<strong>in</strong>gattribute to its data set and fill it with null values.Even though the miss<strong>in</strong>g attribute mighthave high predictive value for DB A , it is ofno value to DB B . After all, DB B did not<strong>in</strong>clude it <strong>in</strong> its schema, and presumablyother attributes (<strong>in</strong>clud<strong>in</strong>g the common ones)have predictive value.Method II: Learn a local model without themiss<strong>in</strong>g attribute and exchange. In thisapproach, database DB A can learn two localmodels: one with the attribute A n+1 that can beused locally by the metalearn<strong>in</strong>g agents andone without it that can be subsequentlyexchanged. Learn<strong>in</strong>g a second classifier agentwithout the miss<strong>in</strong>g attribute, or with theattributes that belong to the <strong>in</strong>tersection of theattributes of the two databases’ data sets,implies that the second classifier uses only theattributes that are common among the participat<strong>in</strong>gsites and no issue exists for its <strong>in</strong>tegrationat other databases. But, remote classifiersimported by database DB A (and assurednot to <strong>in</strong>volve predictions over the miss<strong>in</strong>gattributes) can still be locally <strong>in</strong>tegrated withthe orig<strong>in</strong>al model that employs A n+1 . In thiscase, the remote classifiers simply ignore thelocal data set’s miss<strong>in</strong>g attributes.Both approaches address the <strong>in</strong>compatibleschema problem, and metalearn<strong>in</strong>g overthese models should proceed <strong>in</strong> a straightforwardmanner. Compared to ZbigniewRas’s related work, 6 our approach is moregeneral because our techniques support bothcategorical and cont<strong>in</strong>uous attributes and arenot limited to a specific syntactic case orpurely logical consistency of generated rulemodels. Instead, these bridg<strong>in</strong>g techniquescan employ mach<strong>in</strong>e- or statistical-learn<strong>in</strong>galgorithms to compute arbitrary models forthe miss<strong>in</strong>g values.Prun<strong>in</strong>gAn ensemble of classifiers can be unnecessarilycomplex, mean<strong>in</strong>g that many classifiersmight be redundant, wast<strong>in</strong>g resources andreduc<strong>in</strong>g system throughput. (Throughputhere denotes the rate at which a metaclassifiercan pipe through and label a stream of dataitems.) We study the efficiency of metaclassifiersby <strong>in</strong>vestigat<strong>in</strong>g the effects of prun<strong>in</strong>g(discard<strong>in</strong>g certa<strong>in</strong> base classifiers) on theirperformance. Determ<strong>in</strong><strong>in</strong>g the optimal set ofclassifiers for metalearn<strong>in</strong>g is a comb<strong>in</strong>atorialproblem. Hence, the objective of prun<strong>in</strong>g isto utilize heuristic methods to search for partiallygrown metaclassifiers (metaclassifierswith pruned subtrees) that are more efficientand scalable and at the same time achievecomparable or better predictive performance72 IEEE INTELLIGENT SYSTEMS


Table 3. Results on knowledge shar<strong>in</strong>g and prun<strong>in</strong>g.results than fully grown (unpruned) metaclassifiers.To this end, we <strong>in</strong>troduced twostages for prun<strong>in</strong>g metaclassifiers; the pretra<strong>in</strong><strong>in</strong>gand posttra<strong>in</strong><strong>in</strong>g prun<strong>in</strong>g stages. Bothlevels are essential and complementary to eachother with respect to the improvement of thesystem’s accuracy and efficiency.Pretra<strong>in</strong><strong>in</strong>g prun<strong>in</strong>g refers to the filter<strong>in</strong>gof the classifiers before they are comb<strong>in</strong>ed.Instead of comb<strong>in</strong><strong>in</strong>g classifiers <strong>in</strong> a bruteforcemanner, we <strong>in</strong>troduce a prelim<strong>in</strong>arystage for analyz<strong>in</strong>g the available classifiersand qualify<strong>in</strong>g them for <strong>in</strong>clusion <strong>in</strong> a comb<strong>in</strong>edmetaclassifier. Only those classifiersthat appear (accord<strong>in</strong>g to one or more predef<strong>in</strong>edmetrics) to be most promis<strong>in</strong>g participate<strong>in</strong> the f<strong>in</strong>al metaclassifier. Here, weadopt a black-box approach that evaluates theset of classifiers based only on their <strong>in</strong>put andoutput behavior, not their <strong>in</strong>ternal structure.Conversely, posttra<strong>in</strong><strong>in</strong>g prun<strong>in</strong>g denotes theevaluation and prun<strong>in</strong>g of constituent baseclassifiers after a complete metaclassifier hasbeen constructed. We have implemented andexperimented with three pretra<strong>in</strong><strong>in</strong>g and twoposttra<strong>in</strong><strong>in</strong>g prun<strong>in</strong>g algorithms, each withdifferent search heuristics.The first pretra<strong>in</strong><strong>in</strong>g prun<strong>in</strong>g algorithmranks and selects its classifiers by evaluat<strong>in</strong>geach candidate classifier <strong>in</strong>dependently (metric-based),and the second algorithm decidesby exam<strong>in</strong><strong>in</strong>g the classifiers <strong>in</strong> correlation witheach other (diversity-based). The third relieson the <strong>in</strong>dependent performance of the classifiersand the manner <strong>in</strong> which they predictwith respect to each other and with respect tothe underly<strong>in</strong>g data set (coverage and specialty-based).The first posttra<strong>in</strong><strong>in</strong>g prun<strong>in</strong>galgorithms are based on a cost-complexityprun<strong>in</strong>g technique (a technique the CARTdecision-tree learn<strong>in</strong>g algorithm uses thatseeks to m<strong>in</strong>imize the cost and size of its treewhile reduc<strong>in</strong>g the misclassification rate). Thesecond is based on the correlation between theclassifiers and the metaclassifier. 7Compared to Dragos Marg<strong>in</strong>eantu andThomas Dietterich’s approach, 8 ours considersthe more general sett<strong>in</strong>g where ensemblesof classifiers can be obta<strong>in</strong>ed by apply<strong>in</strong>gpossibly different learn<strong>in</strong>g algorithmsover (possibly) dist<strong>in</strong>ct databases. Furthermore,<strong>in</strong>stead of vot<strong>in</strong>g (such as AdaBoost)over the predictions of classifiers for the f<strong>in</strong>alclassification, we use metalearn<strong>in</strong>g to comb<strong>in</strong>ethe <strong>in</strong>dividual classifiers’ predictions.Evaluation of knowledge shar<strong>in</strong>g andprun<strong>in</strong>g. First, we distributed the data setsacross six different data sites (each site storestwo months of data) and prepared the set ofcandidate base classifiers—that is, the orig<strong>in</strong>alset of base classifiers the prun<strong>in</strong>g algorithmis called to evaluate. We computedthese classifiers by apply<strong>in</strong>g five learn<strong>in</strong>galgorithms (Bayes, C4.5, ID3, CART, andRipper) to each month of data, creat<strong>in</strong>g 60base classifiers (10 classifiers per data site).Next, we had each data site import the remotebase classifiers (50 <strong>in</strong> total) that we subsequentlyused <strong>in</strong> the prun<strong>in</strong>g and metalearn<strong>in</strong>gphases, thus ensur<strong>in</strong>g that each classifierwould not be tested unfairly on known data.Specifically, we had each site use half of itslocal data (one month) to test, prune, andmetalearn the base classifiers and the otherhalf to evaluate the pruned and unprunedmetaclassifier’s overall performance. 7 Inessence, this experiment’s sett<strong>in</strong>g correspondsto a parallel sixfold cross-validation.F<strong>in</strong>ally, we had the two simulated banksexchange their classifier agents. In additionto its 10 local and 50 <strong>in</strong>ternal classifiers(those imported from their peer data sites),each site also imported 60 external classifiers(from the other bank). Thus, we populatedeach simulated Chase data site with 60 (10 +50) Chase classifiers and 60 First Union classifiers,and we populated each First Unionsite with 60 (10 + 50) First Union classifiersand 60 Chase classifiers. Aga<strong>in</strong>, the sites usedhalf of their local data (one month) to test,prune, and metalearn the base classifiers andthe other half to evaluate the pruned orunpruned metaclassifier’s overall performance.To ensure fairness, we did not use the10 local classifiers <strong>in</strong> metalearn<strong>in</strong>g.The two databases, however, had the follow<strong>in</strong>gschema differences:• Chase and First Union def<strong>in</strong>ed a (nearlyidentical) feature with different semantics.• Chase <strong>in</strong>cludes two (cont<strong>in</strong>uous) featuresnot present <strong>in</strong> the First Union data.CHASESAVINGSFIRST UNIONSAVINGSCLASSIFICATION METHOD SIZE ($K) SIZE ($K)COTS scor<strong>in</strong>g system from Chase N/A 682 N/A N/ABest base classifier over entire set 1 762 1 803Best base classifier over one subset 1 736 1 770Metaclassifier 50 818 50 944Metaclassifier 32 870 21 943Metaclassifier (+ First Union) 110 800 N/A N/AMetaclassifier (+ First Union) 63 877 N/A N/AMetaclassifier (+ Chase – bridg<strong>in</strong>g) N/A N/A 110 942Metaclassifier (+ Chase + bridg<strong>in</strong>g) N/A N/A 110 963Metaclassifier (+ Chase + bridg<strong>in</strong>g) N/A N/A 56 962For the first <strong>in</strong>compatibility, we mappedthe First Union data values to the Chasedata’s semantics. For the second <strong>in</strong>compatibility,we deployed bridg<strong>in</strong>g agents to computethe miss<strong>in</strong>g values. 5 When predict<strong>in</strong>g,the First Union classifiers simply disregardedthe real values provided at the Chase datasites, while the Chase classifiers relied onboth the common attributes and the predictionsof the bridg<strong>in</strong>g agents to deliver a predictionat the First Union data sites.Table 3 summarizes our results for theChase and First Union banks, display<strong>in</strong>g thesav<strong>in</strong>gs for each fraud predictor exam<strong>in</strong>ed.The column denoted as size <strong>in</strong>dicates thenumber of base classifiers used <strong>in</strong> the ensembleclassification system. The first row ofTable 3 shows the best possible performanceof Chase’s own COTS authorization anddetection system on this data set. The nexttwo rows present the performance of the bestbase classifiers over the entire set and over as<strong>in</strong>gle month’s data, while the last four rowsdetail the performance of the unpruned (sizeof 50 and 110) and pruned metaclassifiers(size of 32 and 63). The first two of thesemetaclassifiers comb<strong>in</strong>e only <strong>in</strong>ternal (fromChase) base classifiers, while the last twocomb<strong>in</strong>e both <strong>in</strong>ternal and external (fromChase and First Union) base classifiers. Wedid not use bridg<strong>in</strong>g agents <strong>in</strong> these experiments,because Chase def<strong>in</strong>ed all attributesused by the First Union classifier agents.Table 3 records similar data for the FirstUnion data set, with the exception of FirstUnion’s COTS authorization and detectionperformance (it was not made available to us),and the additional results obta<strong>in</strong>ed whenemploy<strong>in</strong>g special bridg<strong>in</strong>g agents from Chaseto compute the values of First Union’s miss<strong>in</strong>gattributes. Most obviously, these experimentsshow the superior performance of metalearn<strong>in</strong>gover the s<strong>in</strong>gle-model approaches andover the traditional authorization and detectionsystems (at least for the given data sets).The metaclassifiers outperformed the s<strong>in</strong>gleNOVEMBER/DECEMBER 1999 73


ase classifiers <strong>in</strong> every category. Moreover,by bridg<strong>in</strong>g the two databases, we managedto further improve the metalearn<strong>in</strong>g system’sperformance.However, comb<strong>in</strong><strong>in</strong>g classifiers’ agentsfrom the two banks directly (without bridg<strong>in</strong>g)is not very effective, no doubt becausethe attribute miss<strong>in</strong>g from the First Uniondata set is significant <strong>in</strong> model<strong>in</strong>g the Chasedata set. Hence, the First Union classifiersare not as effective as the Chase classifierson the Chase data, and the Chase classifierscannot perform at full strength at the FirstUnion sites without the bridg<strong>in</strong>g agents.This table also shows the <strong>in</strong>valuable contributionof prun<strong>in</strong>g. In all cases, prun<strong>in</strong>g succeeded<strong>in</strong> comput<strong>in</strong>g metaclassifiers with similaror better fraud-detection capabilities, whilereduc<strong>in</strong>g their size and thus improv<strong>in</strong>g theirefficiency. We have provided detailed descriptionon the prun<strong>in</strong>g methods and a comparativestudy between predictive performance andmetaclassifier throughput elsewhere. 7ONE LIMITATION OF OUR APproachto skewed distributions is the need torun prelim<strong>in</strong>ary experiments to determ<strong>in</strong>e thedesired tra<strong>in</strong><strong>in</strong>g distribution based on a def<strong>in</strong>edcost model. This process can be automated,but it is unavoidable because thedesired distribution highly depends on the costmodel and the learn<strong>in</strong>g algorithm. Currently,for simplicity reasons, all the base learners usethe same desired distribution; us<strong>in</strong>g an <strong>in</strong>dividualizedtra<strong>in</strong><strong>in</strong>g distribution for each baselearner could improve the performance. Furthermore,because thieves also learn and fraudpatterns evolve over time, some classifiers aremore relevant than others at a particular time.Therefore, an adaptive classifier-selectionmethod is essential. Unlike a monolithicapproach of learn<strong>in</strong>g one classifier us<strong>in</strong>g<strong>in</strong>cremental learn<strong>in</strong>g, our modular multiclassifierapproach facilitates adaptation over timeand removes out-of-date knowledge.Our experience <strong>in</strong> credit card fraud detectionhas also affected other important applications.Encouraged by our results, we shiftedour attention to the grow<strong>in</strong>g problem of <strong>in</strong>trusiondetection <strong>in</strong> network- and host-basedcomputer systems. Here we seek to performthe same sort of task as <strong>in</strong> the credit card frauddoma<strong>in</strong>. We seek to build models to dist<strong>in</strong>guishbetween bad (<strong>in</strong>trusions or attacks) andgood (normal) connections or processes. Byfirst apply<strong>in</strong>g feature-extraction algorithms,followed by the application of mach<strong>in</strong>e-learn<strong>in</strong>galgorithms (such as Ripper) to learn andcomb<strong>in</strong>e multiple models for different typesof <strong>in</strong>trusions, we have achieved remarkablygood success <strong>in</strong> detect<strong>in</strong>g <strong>in</strong>trusions. 9 Thiswork, as well as the results reported <strong>in</strong> thisarticle, demonstrates conv<strong>in</strong>c<strong>in</strong>gly that distributeddata-m<strong>in</strong><strong>in</strong>g techniques that comb<strong>in</strong>emultiple models produce effective fraud and<strong>in</strong>trusion detectors.AcknowledgmentsWe thank the past and current participants ofthis project: Charles Elkan, Wenke Lee, ShelleyTselepis, Alex Tuzhil<strong>in</strong>, and Junx<strong>in</strong> Zhang. Wealso appreciate the data and expertise provided byChase and First Union for conduct<strong>in</strong>g this study.This research was partially supported by grantsfrom DARPA (F30602-96-1-0311) and the NSF(IRI-96-32225 and CDA-96-25374).References1. P. Chan and S. Stolfo, “Metalearn<strong>in</strong>g for Multistrategyand Parallel Learn<strong>in</strong>g,” Proc. SecondInt’l Workshop Multistrategy Learn<strong>in</strong>g,Center for Artificial Intelligence, GeorgeMason Univ., Fairfax, Va., 1993, pp. 150–165.2. S. Stolfo et al., “JAM: Java Agents for Metalearn<strong>in</strong>gover <strong>Distributed</strong> <strong>Data</strong>bases,” Proc.Third Int’l Conf. Knowledge Discovery and<strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>, AAAI Press, Menlo Park,Calif., 1997, pp. 74–81.3. D. Wolpert, “Stacked Generalization,” NeuralNetworks, Vol. 5, 1992, pp. 241–259.4. P. Chan and S. Stolfo, “Toward ScalableLearn<strong>in</strong>g with Nonuniform Class and CostDistributions: A Case Study <strong>in</strong> <strong>Credit</strong> <strong>Card</strong><strong>Fraud</strong> <strong>Detection</strong>,” Proc. Fourth Int’l Conf.Knowledge Discovery and <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>,AAAI Press, Menlo Park, Calif., 1998, pp.164–168.5. A. Prodromidis and S. Stolfo, “<strong>M<strong>in</strong><strong>in</strong>g</strong> <strong>Data</strong>baseswith Different Schemas: Integrat<strong>in</strong>gIncompatible Classifiers,” Proc. Fourth IntlConf. Knowledge Discovery and <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>,AAAI Press, Menlo Park, Calif., 1998,pp. 314–318.6. Z. Ras, “Answer<strong>in</strong>g Nonstandard Queries <strong>in</strong><strong>Distributed</strong> Knowledge-Based Systems,”Rough Sets <strong>in</strong> Knowledge Discovery, Studies<strong>in</strong> Fuzz<strong>in</strong>ess and Soft Comput<strong>in</strong>g, L. Polkowskiand A. Skowron, eds., Physica Verlag,New York, 1998, pp. 98–108.7. A. Prodromidis and S. Stolfo, “Prun<strong>in</strong>g Metaclassifiers<strong>in</strong> a <strong>Distributed</strong> <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> System,”Proc. First Nat’l Conf. New InformationTechnologies, Editions of New Tech.Athens, 1998, pp. 151–160.8. D. Marg<strong>in</strong>eantu and T. Dietterich, “Prun<strong>in</strong>gAdaptive Boost<strong>in</strong>g,” Proc. 14th Int’l Conf.Mach<strong>in</strong>e Learn<strong>in</strong>g, Morgan Kaufmann, SanFrancisco, 1997, pp. 211–218.9. W. Lee, S. Stolfo, and K. Mok, “<strong>M<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> a<strong>Data</strong>-Flow Environment: Experience <strong>in</strong> NetworkIntrusion <strong>Detection</strong>,” Proc. Fifth Int’lConf. Knowledge Discovery and <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>,AAAI Press, Menlo Park, Calif., 1999,pp. 114–124.Philip K. Chan is an assistant professor of computerscience at the Florida Institute of Technology.His research <strong>in</strong>terests <strong>in</strong>clude scalable adaptivemethods, mach<strong>in</strong>e learn<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g,distributed and parallel comput<strong>in</strong>g, and <strong>in</strong>telligentsystems. He received his PhD, MS, and BS <strong>in</strong> computerscience from Columbia University, VanderbiltUniversity, and Southwest Texas State University,respectively. Contact him at ComputerScience, Florida Tech, Melbourne, FL 32901;pkc@cs.fit. edu; www.cs.fit.edu/~pkc.Wei Fan is a PhD candidate <strong>in</strong> computer science atColumbia University. His research <strong>in</strong>terests are <strong>in</strong>mach<strong>in</strong>e learn<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g, distributed systems,<strong>in</strong>formation retrieval, and collaborative filter<strong>in</strong>g. Hereceived his MSc and B.Eng from Ts<strong>in</strong>ghua Universityand MPhil from Columbia University. Contacthim at the Computer Science Dept., ColumbiaUniv., New York, NY 10027; wfan@cs.columbia.edu;www.cs.columbia.edu/~wfan.Andreas Prodromidis is a director <strong>in</strong> research anddevelopment at iPrivacy. His research <strong>in</strong>terests<strong>in</strong>clude data m<strong>in</strong><strong>in</strong>g, mach<strong>in</strong>e learn<strong>in</strong>g, electroniccommerce, and distributed systems. He received aPhD <strong>in</strong> computer science from Columbia Universityand a Diploma <strong>in</strong> electrical eng<strong>in</strong>eer<strong>in</strong>g fromthe National Technical University of Athens. He isa member of the IEEE and AAAI. Contact him atiPrivacy, 599 Lex<strong>in</strong>gton Ave., #2300, New York,NY 10022; andreas@iprivacy.com; www.cs.columbia. edu/~andreas.Salvatore J. Stolfo is a professor of computer scienceat Columbia University and codirector of theUSC/ISI and Columbia Center for AppliedResearch <strong>in</strong> Digital Government Information Systems.His most recent research has focused on distributeddata-m<strong>in</strong><strong>in</strong>g systems with applications tofraud and <strong>in</strong>trusion detection <strong>in</strong> network <strong>in</strong>formationsystems. He received his PhD from NYU’sCourant Institute. Contact him at the ComputerScience Dept., Columbia Univ., New York, NY10027; sal@cs.columbia.edu; www.cs.columbia.edu/sal; www.cs.columbia.edu/~sal.74 IEEE INTELLIGENT SYSTEMS

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!