Distributed Data Mining in Credit Card Fraud Detection

D A T A M I N I N GDistributed Data Mining inCredit Card Fraud DetectionPhilip K. Chan, Florida Institute of TechnologyWei Fan, Andreas L. Prodromidis, and Salvatore J. Stolfo, Columbia UniversityCREDIT CARD TRANSACTIONS CONtinueto grow in number, taking an ever-largershare of the US payment system and leadingto a higher rate of stolen account numbersand subsequent losses by banks. Improvedfraud detection thus has become essential tomaintain the viability of the US payment system.Banks have used early fraud warningsystems for some years.Large-scale data-mining techniques canimprove on the state of the art in commercialpractice. Scalable techniques to analyze massiveamounts of transaction data that efficientlycompute fraud detectors in a timelymanner is an important problem, especiallyfor e-commerce. Besides scalability and efficiency,the fraud-detection task exhibits technicalproblems that include skewed distributionsof training data and nonuniform costper error, both of which have not been widelystudied in the knowledge-discovery and dataminingcommunity.In this article, we survey and evaluate anumber of techniques that address these threemain issues concurrently. Our proposedmethods of combining multiple learned frauddetectors under a “cost model” are generaland demonstrably useful; our empiricalresults demonstrate that we can significantlyreduce loss due to fraud through distributeddata mining of fraud models.THIS SCALABLE BLACK-BOX APPROACH FOR BUILDINGEFFICIENT FRAUD DETECTORS CAN SIGNIFICANTLY REDUCELOSS DUE TO ILLEGITIMATE BEHAVIOR. IN MANY CASES, THEAUTHORS’ METHODS OUTPERFORM A WELL-KNOWN, STATE-OF-THE-ART COMMERCIAL FRAUD-DETECTION SYSTEM.Our approachIn today’s increasingly electronic societyand with the rapid advances of electronic commerceon the Internet, the use of credit cardsfor purchases has become convenient and necessary.Credit card transactions have becomethe de facto standard for Internet and Webbasede-commerce. The US government estimatesthat credit cards accounted for approximatelyUS $13 billion in Internet sales during1998. This figure is expected to grow rapidlyeach year. However, the growing number ofcredit card transactions provides more opportunityfor thieves to steal credit card numbersand subsequently commit fraud. When bankslose money because of credit card fraud, cardholderspay for all of that loss through higherinterest rates, higher fees, and reduced benefits.Hence, it is in both the banks’ and thecardholders’interest to reduce illegitimate useof credit cards by early fraud detection. Formany years, the credit card industry has studiedcomputing models for automated detectionsystems; recently, these models have beenthe subject of academic research, especiallywith respect to e-commerce.The credit card fraud-detection domainpresents a number of challenging issues fordata mining:• There are millions of credit card transactionsprocessed each day. Mining suchmassive amounts of data requires highlyefficient techniques that scale.• The data are highly skewed—many moretransactions are legitimate than fraudulent.Typical accuracy-based mining techniquescan generate highly accurate frauddetectors by simply predicting that allNOVEMBER/DECEMBER 1999 1094-7167/99/$10.00 © 1999 IEEE 67

The AdaCost algorithmOne of the most important results of our experimental work on thisdomain was the realization that each of the base-learning algorithmsemployed in our experiments utilized internal heuristics based on trainingaccuracy, not cost. This leads us to investigate new algorithms thatemploy internal metrics of misclassification cost when computing hypothesesto predict fraud.Here, we describe AdaCost 1 (a variant of AdaBoost 2,3 ), which reducesboth fixed and variable misclassification costs more significantly thanAdaBoost. We follow the generalized analysis of AdaBoost by RobertSchapire and Yoram Singer. 3 Figure A shows the algorithm. Let S =((x 1 ,c 1 ,y 1 ), …, (x m ,c m ,y m )) be a sequence of training examples where eachinstance x i belongs to a domain X, each cost factor c i belongs to the nonnegativereal domain R + , and each label y i belongs to a finite label spaceY. We only focus on binary classification problems in which Y = {–1,+1}. h is a weak hypothesis—it has the form h:X – R. The sign of h(x) isinterpreted as the predicted label, and the magnitude |h(x)| is the “confidence”in this prediction. Let t be an index to show the round of boostingand D t (i) be the weight given to (x i ,c i ,y i ) at the tth round. 0 ≤ D t (i) ≤1, andΣD t (i) = 1 is the chosen parameter as a weight for weak hypothesis h t atthe tth round. We assume α t > 0. β(sign(y i h t (x i )),c i ) is a cost-adjustmentfunction with two arguments: sign(y i h t (x i )) to show if h t (x i ) is correct, andthe cost factor c i .The difference between AdaCost and AdaBoost is the additional costadjustmentfunction β(sign(y i h t (x i )),c i ) in the weight-updating rule.Given: (x1, c1, y1), …,(x m , c m , y m ):x i ∈ X,c i ∈ R+,y i ∈ {–1,+1}Initialize D1(i) (such as D 1 (i) = (such as D 1 (i) = c i /∑ m j c j )For t = 1, …,T:1. Train weak learner using distribution D t .2. Compute weak hypothesis h t : X→R.3. Choose α t ∈R and β(i) ∈ R +4. UpdateFigure A. AdaCost.Dt i − tyh i t xiy h x cDtii t i i+ 1 () ( ) exp α ( ) β( sign( ( )), )=Ztwhere β(sign(y i h t (x i )),c i ) is a cost-adjustment function.Z t is a normalization factor chosen so that D t+1will be a distribution.Output the final hypothesis:H(x) = sign(f(x)) where f(x) =( )⎛ T ⎞⎜∑α t h t ( x)⎟⎝ t=1 ⎠Where it is clear in context, we use either β(i) or β(c i ) as a shorthand forβ(sign(y i h t (x i )),c i ). Furthermore, we use β + when sign (y i h t (x i )) = +1 andβ – when sign(y i h t (x i )) = –1. For an instance with a higher cost factor, β(i)increases its weights “more” if the instance is misclassified, butdecreases its weight “less” otherwise. Therefore, we require β – (c i ) to benondecreasing with respect to c i ,β + (c i ) to be nonincreasing, and both arenonnegative. We proved that AdaCost reduces cost on the training data. 1Logically, we can assign a cost factor c of tranamt – overhead tofrauds and a factor c of overhead to nonfrauds. This reflects how the predictionerrors will add to the total cost of a hypothesis. Because theactual overhead is a closely guarded trade secret and is unknown to us,we chose to set overhead ∈ {60, 70, 80, 90} to run four sets ofexperiments. We normalized each c i to [0,1]. The cost adjustment functionβ is chosen as: β – (c) = 0.5⋅c + 0.5 and β + (c) = –0.5⋅c + 0.5.As in previous experiments, we use training data from one month anddata from two months later for testing. Our data set let us form 10 suchpairs of training and test sets. We ran both AdaBoost and AdaCost to the50th round. We used Ripper as the “weak” learner because it provides aneasy way to change the training set’s distribution. Because using thetraining set alone usually overestimates a rule set’s accuracy, we used theLaplace estimate to generate the confidence for each rule.We are interested in comparing Ripper (as a baseline), AdaCost, andAdaBoost in several dimensions. First, for each data set and cost model,we determine which algorithm has achieved the lowest cumulative misclassificationcost. We wish to know in how many cases AdaCost is theclear winner. Second, we also seek to know, quantitatively, the differencein cumulative misclassification cost of AdaCost from AdaBoost and thebaseline Ripper. It is interesting to measure the significance of these differencesin terms of both reduction in misclassification loss and percentageof reduction. Finally, we are interested to know if AdaCost requiresmore computing power.Figure B plots the results from the Chase Bank’s credit card data. FigureB1 shows the average reduction of 10 months in percentage cumulativeloss (defined as cumulativeloss/ maximalloss – leastloss * 100%) forAdaBoost and AdaCost for all 50 rounds with an overhead of $60 (resultsfor other overheads are in other work 1 ). We can clearly see that there is aconsistent reduction. The absolute amount of reduction is around 3%We also observe that the speed of reduction by AdaCost is quickerthan that of AdaBoost. The speed is the highest in the first few rounds.This means that in practice, we might not need to run AdaCost for manyrounds. Figure B2 plots the ratio of cumulative cost by AdaCost andAdaBoost. We have plotted the results of all 10 pairs of training and testmonths over all rounds and overheads. Most of the points are above the y= x line in Figure B2, implying that AdaCost has lower cumulative loss inan overwhelming number of cases.References1. W. Fan et al., “Adacost: Misclassification Cost-SensitiveBoosting,” Proc. 16th Int’l Conf. Machine Learning, Morgan Kaufmann,San Francisco, 1999, pp. 97–105.transactions are legitimate, although thisis equivalent to not detecting fraud at all.• Each transaction record has a differentdollar amount and thus has a variablepotential loss, rather than a fixed misclassificationcost per error type, as is commonlyassumed in cost-based miningtechniques.Our approach addresses the efficiency andscalability issues in several ways. We dividea large data set of labeled transactions (eitherfraudulent or legitimate) into smaller subsets,apply mining techniques to generate classifiersin parallel, and combine the resultant basemodels by metalearning from the classifiers’behavior to generate a metaclassifier. 1 Ourapproach treats the classifiers as black boxesso that we can employ a variety of learningalgorithms. Besides extensibility, combiningmultiple models computed over all availabledata produces metaclassifiers that can offsetthe loss of predictive performance that usuallyoccurs when mining from data subsets orsampling. Furthermore, when we use thelearned classifiers (for example, during transactionauthorization), the base classifiers canexecute in parallel, with the metaclassifier thencombining their results. So, our approach ishighly efficient in generating these models andalso relatively efficient in applying them.Another parallel approach focuses on parallelizinga particular algorithm on a particu-68 IEEE INTELLIGENT SYSTEMS

Percentage cumulative misclassification cost0.3950.390.3850.380.3750.370.3650.360 5 10 15 20 25 30 35 40 45 50(1)Boosting RoundsAdaBoost cumulative cost(2)0.520.50.480.460.440.420.40.380.360.340.322. Y. Freund and R. Schapire, “Experiments with a New Boosting Algorithm,” Proc. 13th Conf.Machine Learning, Morgan Kaufmann, San Francisco, 1996, pp. 148–156.3. R. Schapire and Y. Singer, “Improved Boosting Algorithms Using Confidence-Rated Predictions,”Proc. 11th Conf. Computational Learning Theory, ACM Press, New York, 1998.lar parallel architecture. However, a new algorithmor architecture requires a substantialamount of parallel-programming work. Althoughour architecture- and algorithm-independentapproach is not as efficient as somefine-grained parallelization approaches, it letsusers plug different off-the-shelf learning programsinto a parallel and distributed environmentwith relative ease and eliminates theneed for expensive parallel hardware.Furthermore, because our approach couldAdaBoostAdaCostOverhead = 60y=xPercentagel of AdaCost and AdaBoost0.30.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5AdaCost cumulative costFigure B. Cumulative cost versus rounds (1); AdaBoost versus AdaCost in cumulative cost (2).generate a potentially large number of classifiersfrom the concurrently processed datasubsets, and therefore potentially require morecomputational resources during detection, weinvestigate pruning methods that identifyredundant classifiers and remove them fromthe ensemble without significantly degradingthe predictive performance. This pruning techniqueincreases the learned detectors’ computationalperformance and throughput.The issue of skewed distributions has notbeen studied widely because many of the datasets used in research do not exhibit this characteristic.We address skewness by partitioningthe data set into subsets with a desired distribution,applying mining techniques to thesubsets, and combining the mined classifiersby metalearning (as we have already discussed).Other researchers attempt to removeunnecessary instances from the majorityclass—instances that are in the borderlineregion (noise or redundant exemplars) are candidatesfor removal. In contrast, our approachkeeps all the data for mining and does notchange the underlying mining algorithms.We address the issue of nonuniform costby developing the appropriate cost model forthe credit card fraud domain and biasing ourmethods toward reducing cost. This costmodel determines the desired distributionjust mentioned. AdaCost (a cost-sensitiveversion of AdaBoost) relies on the cost modelfor updating weights in the training distribution.(For more on AdaCost, see the “Ada-Cost algorithm” sidebar.) Naturally, this costmodel also defines the primary evaluationcriterion for our techniques. Furthermore, weinvestigate techniques to improve the costperformance of a bank’s fraud detector byimporting remote classifiers from otherbanks and combining this remotely learnedknowledge with locally stored classifiers.The law and competitive concerns restrictbanks from sharing information about theircustomers with other banks. However, theymay share black-box fraud-detection models.Our distributed data-mining approachprovides a direct and efficient solution tosharing knowledge without sharing data. Wealso address possible incompatibility of dataschemata among different banks.We designed and developed an agentbaseddistributed environment to demonstrateour distributed and parallel data-miningtechniques. The JAM (Java Agents forMetalearning) system not only provides distributeddata-mining capabilities, it also letsusers monitor and visualize the various learningagents and derived models in real time.Researchers have studied a variety of algorithmsand techniques for combining multiplecomputed models. The JAM system providesgeneric features to easily implementany of these combining techniques (as wellas a large collection of base-learning algorithms),and it has been broadly available foruse. The JAM system is available for downloadat http://www.cs.columbia.edu/~sal/JAM/PROJECT. 2NOVEMBER/DECEMBER 1999 69

Table 1. Cost model assuming a fixed overhead.Credit card data and costmodelsChase Bank and First Union Bank, membersof the Financial Services TechnologyConsortium (FSTC), provided us with realcredit card data for this study. The two datasets contain credit card transactions labeledas fraudulent or legitimate. Each bank supplied500,000 records spanning one year with20% fraud and 80% nonfraud distribution forChase Bank and 15% versus 85% for FirstUnion Bank. In practice, fraudulent transactionsare much less frequent than the 15% to20% observed in the data given to us. Thesedata might have been cases where the bankshave difficulty in determining legitimacy correctly.In some of our experiments, we deliberatelycreate more skewed distributions toevaluate the effectiveness of our techniquesunder more extreme conditions.Bank personnel developed the schemataof the databases over years of experience andcontinuous analysis to capture importantinformation for fraud detection. We cannotreveal the details of the schema beyond whatwe have described elsewhere. 2 The recordsof one schema have a fixed length of 137bytes each and about 30 attributes, includingthe binary class label (fraudulent/legitimatetransaction). Some fields are numeric and therest categorical. Because account identificationis not present in the data, we cannotgroup transactions into accounts. Therefore,instead of learning behavior models of individualcustomer accounts, we build overallmodels that try to differentiate legitimatetransactions from fraudulent ones. Our modelsare customer-independent and can serveas a second line of defense, the first beingcustomer-dependent models.Most machine-learning literature concentrateson model accuracy (either training erroror generalization error on hold-out test datacomputed as overall accuracy, true-positive orfalse-positive rates, or return-on-cost analysis).This domain provides a considerably differentmetric to evaluate the learned models’performance—modelsare evaluated and rated by acost model. Due to the different dollar amountof each credit card transaction and other factors,the cost of failing to detect a fraud varieswith each transaction. Hence, the cost modelfor this domain relies on the sum and averageof loss caused by fraud. We definenCumulativeCost = ∑ Cost()iiandOUTCOMEAverageCost =CumulativeCostnwhere Cost(i) is the cost associated withtransactions i, and n is the total number oftransactions.After consulting with a bank representative,we jointly settled on a simplified costmodel that closely reflects reality. Because ittakes time and personnel to investigate apotentially fraudulent transaction, each investigationincurs an overhead. Other relatedcosts—for example, the operational resourcesneeded for the fraud-detection system—areconsolidated into overhead. So, if the amountof a transaction is smaller than the overhead,investigating the transaction is not worthwhileeven if it is suspicious. For example, if it takes$10 to investigate a potential loss of $1, it ismore economical not to investigate it. Therefore,assuming a fixed overhead, we devisedthe cost model shown in Table 1 for eachtransaction, where tranamt is the amount of acredit card transaction. The overhead threshold,for obvious reasons, is a closely guardedsecret and varies over time. The range of valuesused here are probably reasonable levelsas bounds for this data set, but are probablysignificantly lower. We evaluated all ourempirical studies using this cost model.Skewed distributionsCOSTMiss (false negative—FN) TranamtFalse alarm (false positive—FP) Overhead if tranamt > overhead or 0 if tranamt ≤ overheadHit (true positive—TP)Overhead if tranamt > overhead or tranamt if tranamt ≤ overheadNormal (true negative—TN) 0Given a skewed distribution, we would liketo generate a training set of labeled transactionswith a desired distribution without removingany data, which maximizes classifier performance.In this domain, we found that determiningthe desired distribution is an experimentalart and requires extensive empirical teststo find the most effective training distribution.In our approach, we first create data subsetswith the desired distribution (determinedby extensive sampling experiments). Thenwe generate classifiers from these subsetsand combine them by metalearning fromtheir classification behavior. For example, ifthe given skewed distribution is 20:80 andthe desired distribution for generating thebest models is 50:50, we randomly divide themajority instances into four partitions andform four data subsets by merging the minorityinstances with each of the four partitionscontaining majority instances. That is, theminority instances replicate across four datasubsets to generate the desired 50:50 distributionfor each distributed training set.For concreteness, let N be the size of the dataset with a distribution of x:y (x is the percentageof the minority class) and u:v be the desireddistribution. The number of minority instancesis N × x, and the desired number of majorityinstances in a subset is Nx × v/u. The numberof subsets is the number of majority instances(N × y) divided by the number of desiredmajority instances in each subset, which is Nydivided by Nxv/u or y/x × u/v. So, we havey/x × u/v subsets, each of which has Nx minorityinstances and Nxv/u majority instances.The next step is to apply a learning algorithmor algorithms to each subset. Becausethe learning processes on the subsets are independent,the subsets can be distributed to differentprocessors and each learning processcan run in parallel. For massive amounts ofdata, our approach can substantially improvespeed for superlinear time-learning algorithms.The generated classifiers are combinedby metalearning from their classificationbehavior. We have described severalmetalearning strategies elsewhere. 1To simplify our discussion, we onlydescribe the class-combiner (or stacking)strategy. 3 This strategy composes a metaleveltraining set by using the base classifiers’predictionson a validation set as attribute valuesand the actual classification as the classlabel. This training set then serves for traininga metaclassifier. For integrating subsets,the class-combiner strategy is more effectivethan the voting-based techniques. When thelearned models are used during online frauddetection, transactions feed into the learnedbase classifiers and the metaclassifier thencombines their predictions. Again, the baseclassifiers are independent and can executein parallel on different processors. In addition,our approach can prune redundant baseclassifiers without affecting the cost performance,making it relatively efficient in thecredit card authorization process.70 IEEE INTELLIGENT SYSTEMS

Table 2. Cost and savings in the credit card fraud domain using class-combiner (cost ± 95% confidence interval).OVERHEAD = $50 OVERHEAD = $75 OVERHEAD = $100FRAUD SAVED SAVED SAVED SAVED SAVED SAVEDMETHOD (%) COST (%) ($K) COST (%) ($K) COST (%) ($K)Class combiner 50 17.96 ±.14 51 761 20.07 ±.13 46 676 21.87 ±.12 41 604Single CART 50 20.81 ±.75 44 647 23.64 ±.96 36 534 26.05 ± 1.25 30 437Class combiner Given 22.61 39 575 23.99 35 519 25.20 32 471Average singleclassifier Given 27.97 ± 1.64 24 360 29.08 ± 1.60 21 315 30.02 ± 1.56 19 278COTS N/A 25.44 31 461 26.40 29 423 27.24 26 389Experiments and results. To evaluate ourmulticlassifier class-combiner approach toskewed class distributions, we performed aset of experiments using the credit card frauddata from Chase. 4 We used transactions fromthe first eight months (10/95–5/96) for training,the ninth month (6/96) for validating,and the twelfth month (9/96) for testing.(Because credit card transactions have a naturaltwo-month business cycle—the time tobill a customer is one month, followed by aone-month payment period—the true labelof a transaction cannot be determined in lessthan two months’time. Hence, building modelsfrom data in one month cannot be rationallyapplied for fraud detection in the nextmonth. We therefore test our models on datathat is at least two months newer.)Based on the empirical results from theeffects of class distributions, the desired distributionis 50:50. Because the given distributionis 20:80, four subsets are generated from eachmonth for a total of 32 subsets. We applied fourlearning algorithms (C4.5, CART, Ripper, andBayes) to each subset and generated 128 baseclassifiers. Based on our experience with trainingmetaclassifiers, Bayes is generally moreeffective and efficient, so it is the metalearnerfor all the experiments reported here.Furthermore, to investigate if our approachis indeed fruitful, we ran experiments on theclass-combiner strategy directly applied tothe original data sets from the first eightmonths (that is, they have the given 20:80distribution). We also evaluated how individualclassifiers generated from each monthperform without class-combining.Table 2 shows the cost and savings from theclass-combiner strategy using the 50:50 distribution(128 base classifiers), the average ofindividual CART classifiers generated usingthe desired distribution (10 classifiers), classcombinerusing the given distribution (32 baseclassifiers—8 months × 4 learning algorithms),and the average of individual classifiersusing the given distribution (the averageof 32 classifiers). (We did not perform experimentson simply replicating the minorityinstances to achieve 50:50 in one single dataset because this approach increases the training-setsize and is not appropriate in domainswith large amounts of data—one of the threeprimary issues we address here.) Comparedto the other three methods, class-combiningon subsets with a 50:50 fraud distributionclearly achieves a significant increase in savings—atleast $110,000 for the month (6/96).When the overhead is $50, more than half ofthe losses were prevented.Surprisingly, we also observe that when theoverhead is $50, a classifier (Single CART)trained from one month’s data with the desired50:50 distribution (generated by ignoringsome data) achieved significantly more savingsthan combining classifiers trained fromall eight months’data with the given distribution.This reaffirms the importance of employingthe appropriate training class distributionin this domain. Class-combiner also contributedto the performance improvement.Consequently, utilizing the desired trainingdistribution and class-combiner provides asynergistic approach to data mining withnonuniform class and cost distributions. Perhapsmore importantly, how do our techniquesperform compared to the bank’s existingfraud-detection system? We label the currentsystem “COTS” (commercial off-the-shelfsystem) in Table 2. COTS achieved significantlyless savings than our techniques in thethree overhead amounts we report in this table.This comparison might not be entirelyaccurate because COTS has much moretraining data than we have and it might beoptimized to a different cost model (whichmight even be the simple error rate). Furthermore,unlike COTS, our techniques aregeneral for problems with skewed distributionsand do not utilize any domain knowledgein detecting credit card fraud—the onlyexception is the cost model used for evaluationand search guidance. Nevertheless,COTS’performance on the test data providessome indication of how the existing frauddetectionsystem behaves in the real world.We also evaluated our method with moreskewed distributions (by downsamplingminority instances): 10:90, 1:99, and 1:999.As we discussed earlier, the desired distributionis not necessarily 50:50—for instance,the desired distribution is 30:70 when thegiven distribution is 10:90. With 10:90 distributions,our method reduced the cost significantlymore than COTS. With 1:99 distributions,our method did not outperformCOTS. Both methods did not achieve anysavings with 1:999 distributions.To characterize the condition when ourtechniques are effective, we calculate R, theratio of the overhead amount to the averagecost: R = Overhead/Average cost. Our approachis significantly more effective thanthe deployed COTS when R < 6. Both methodsare not effective when the R > 24. So,under a reasonable cost model with a fixedoverhead cost in challenging transactions aspotentially fraudulent, when the number offraudulent transactions is a very small percentageof the total, it is financially undesirableto detect fraud. The loss due to this fraudis yet another cost of conducting business.However, filtering out “easy” (or low-risk)transactions (the data we received were possiblyfiltered by a similar process) can reducea high overhead-to-loss ratio. The filteringprocess can use fraud detectors that are builtbased on individual customer profiles, whichare now in use by many credit card companies.These individual profiles characterizethe customers’ purchasing behavior. Forexample, if a customer regularly buys groceriesat a particular supermarket or has setup a monthly payment for phone bills, thesetransactions are close to no risk; hence. purchasesof similar characteristics can be safelyauthorized without further checking. Reducingthe overhead through streamlining businessoperations and increased automationwill also lower the ratio.Knowledge sharing throughbridgingMuch of the prior work on combiningmultiple models assumes that all modelsoriginate from different (not necessarily dis-NOVEMBER/DECEMBER 1999 71

tinct) subsets of a single data set as a meansto increase accuracy (for example, by imposingprobability distributions over the instancesof the training set, or by stratifiedsampling, subsampling, and so forth) and notas a means to integrate distributed information.Although the JAM system addresses thelatter problem by employing metalearningtechniques, integrating classification modelsderived from distinct and distributed databasesmight not always be feasible.In all cases considered so far, all classificationmodels are assumed to originate fromdatabases of identical schemata. Becauseclassifiers depend directly on the underlyingdata’s format, minor differences in the schematabetween databases derive incompatibleclassifiers—that is, a classifier cannot beapplied on data of different formats. Yet theseclassifiers may target the same concept. Weseek to bridge these disparate classifiers insome principled fashion.The banks seek to be able to exchangetheir classifiers and thus incorporate usefulinformation in their system that would otherwisebe inaccessible to both. Indeed, foreach credit card transaction, both institutionsrecord similar information, however, theyalso include specific fields containing importantinformation that each has acquired separatelyand that provides predictive value indetermining fraudulent transaction patterns.To facilitate the exchange of knowledge (representedas remotely learned models) andtake advantage of incompatible and otherwiseuseless classifiers, we need to devisemethods that bridge the differences imposedby the different schemata.Database compatibility. The incompatibleschema problem impedes JAM from takingadvantage of all available databases. Let’sconsider two data sites A and B with databasesDB A and DB B , having similar but notidentical schemata. Without loss of generality,we assume thatSchema(DB A ) = {A 1 , A 2 , ..., A n , A n+1 , C}Schema(DB B ) = {B 1 , B 2 , ..., B n , B n+1 , C}where, A i and B i denote the ith attribute ofDB A and DB B , respectively, and C the classlabel (for example, the fraud/legitimate labelin the credit card fraud illustration) of eachinstance. Without loss of generality, we furtherassume that A i = B i ,1 ≤ i ≤ n. As for theA n+1 and B n+1 attributes, there are twopossibilities:1. A n+1 ≠ B n+1 : The two attributes are ofentirely different types drawn from distinctdomains. The problem can then bereduced to two dual problems where onedatabase has one more attribute than theother; that is,Schema(DB A ) = {A 1 ,A 2 ,..., A n ,A n+1 ,C}Schema(DB B )= {B 1 ,B 2 ,..., B n ,C}where we assume that attribute B n+1 isnot present in DB B . (The dual problemhas DB B composed with B n+1, but A n+1 isnot available to A.)2. A n+1 ≈ B n+1 : The two attributes are ofsimilar type but slightly differentsemantics—that is, there might be amap from the domain of one type to thedomain of the other. For example, A n+1and B n+1 are fields with time-dependentinformation but of different duration(that is, A n+1 might denote the numberof times an event occurred within a windowof half an hour and B n+1 mightdenote the number of times the sameevent occurred but within 10 minutes).In both cases (attribute A n+1 is either notpresent in DB B or semantically different fromthe corresponding B n+1 ), the classifiers C Ajderived from DB A are not compatible withDB B ’s data and therefore cannot be directlyused in DB B ’s site and vice versa. But the purposeof using a distributed data-mining systemand deploying learning agents and metalearningtheir classifier agents is to be able to combineinformation from different sources.Bridging methods. There are two methodsfor handling the missing attributes. 5Method I: Learn a local model for the missingattribute and exchange. Database DB Bimports, along with the remote classifieragent, a bridging agent from database DB Athat is trained to compute values for the missingattribute A n+1 in DB B ’s data. Next, DB Bapplies the bridging agent on its own data toestimate the missing values. In many cases,however, this might not be possible or desirableby DB A (for example, in case theattribute is proprietary). The alternative fordatabase DB B is to simply add the missingattribute to its data set and fill it with null values.Even though the missing attribute mighthave high predictive value for DB A , it is ofno value to DB B . After all, DB B did notinclude it in its schema, and presumablyother attributes (including the common ones)have predictive value.Method II: Learn a local model without themissing attribute and exchange. In thisapproach, database DB A can learn two localmodels: one with the attribute A n+1 that can beused locally by the metalearning agents andone without it that can be subsequentlyexchanged. Learning a second classifier agentwithout the missing attribute, or with theattributes that belong to the intersection of theattributes of the two databases’ data sets,implies that the second classifier uses only theattributes that are common among the participatingsites and no issue exists for its integrationat other databases. But, remote classifiersimported by database DB A (and assurednot to involve predictions over the missingattributes) can still be locally integrated withthe original model that employs A n+1 . In thiscase, the remote classifiers simply ignore thelocal data set’s missing attributes.Both approaches address the incompatibleschema problem, and metalearning overthese models should proceed in a straightforwardmanner. Compared to ZbigniewRas’s related work, 6 our approach is moregeneral because our techniques support bothcategorical and continuous attributes and arenot limited to a specific syntactic case orpurely logical consistency of generated rulemodels. Instead, these bridging techniquescan employ machine- or statistical-learningalgorithms to compute arbitrary models forthe missing values.PruningAn ensemble of classifiers can be unnecessarilycomplex, meaning that many classifiersmight be redundant, wasting resources andreducing system throughput. (Throughputhere denotes the rate at which a metaclassifiercan pipe through and label a stream of dataitems.) We study the efficiency of metaclassifiersby investigating the effects of pruning(discarding certain base classifiers) on theirperformance. Determining the optimal set ofclassifiers for metalearning is a combinatorialproblem. Hence, the objective of pruning isto utilize heuristic methods to search for partiallygrown metaclassifiers (metaclassifierswith pruned subtrees) that are more efficientand scalable and at the same time achievecomparable or better predictive performance72 IEEE INTELLIGENT SYSTEMS

Table 3. Results on knowledge sharing and pruning.results than fully grown (unpruned) metaclassifiers.To this end, we introduced twostages for pruning metaclassifiers; the pretrainingand posttraining pruning stages. Bothlevels are essential and complementary to eachother with respect to the improvement of thesystem’s accuracy and efficiency.Pretraining pruning refers to the filteringof the classifiers before they are combined.Instead of combining classifiers in a bruteforcemanner, we introduce a preliminarystage for analyzing the available classifiersand qualifying them for inclusion in a combinedmetaclassifier. Only those classifiersthat appear (according to one or more predefinedmetrics) to be most promising participatein the final metaclassifier. Here, weadopt a black-box approach that evaluates theset of classifiers based only on their input andoutput behavior, not their internal structure.Conversely, posttraining pruning denotes theevaluation and pruning of constituent baseclassifiers after a complete metaclassifier hasbeen constructed. We have implemented andexperimented with three pretraining and twoposttraining pruning algorithms, each withdifferent search heuristics.The first pretraining pruning algorithmranks and selects its classifiers by evaluatingeach candidate classifier independently (metric-based),and the second algorithm decidesby examining the classifiers in correlation witheach other (diversity-based). The third relieson the independent performance of the classifiersand the manner in which they predictwith respect to each other and with respect tothe underlying data set (coverage and specialty-based).The first posttraining pruningalgorithms are based on a cost-complexitypruning technique (a technique the CARTdecision-tree learning algorithm uses thatseeks to minimize the cost and size of its treewhile reducing the misclassification rate). Thesecond is based on the correlation between theclassifiers and the metaclassifier. 7Compared to Dragos Margineantu andThomas Dietterich’s approach, 8 ours considersthe more general setting where ensemblesof classifiers can be obtained by applyingpossibly different learning algorithmsover (possibly) distinct databases. Furthermore,instead of voting (such as AdaBoost)over the predictions of classifiers for the finalclassification, we use metalearning to combinethe individual classifiers’ predictions.Evaluation of knowledge sharing andpruning. First, we distributed the data setsacross six different data sites (each site storestwo months of data) and prepared the set ofcandidate base classifiers—that is, the originalset of base classifiers the pruning algorithmis called to evaluate. We computedthese classifiers by applying five learningalgorithms (Bayes, C4.5, ID3, CART, andRipper) to each month of data, creating 60base classifiers (10 classifiers per data site).Next, we had each data site import the remotebase classifiers (50 in total) that we subsequentlyused in the pruning and metalearningphases, thus ensuring that each classifierwould not be tested unfairly on known data.Specifically, we had each site use half of itslocal data (one month) to test, prune, andmetalearn the base classifiers and the otherhalf to evaluate the pruned and unprunedmetaclassifier’s overall performance. 7 Inessence, this experiment’s setting correspondsto a parallel sixfold cross-validation.Finally, we had the two simulated banksexchange their classifier agents. In additionto its 10 local and 50 internal classifiers(those imported from their peer data sites),each site also imported 60 external classifiers(from the other bank). Thus, we populatedeach simulated Chase data site with 60 (10 +50) Chase classifiers and 60 First Union classifiers,and we populated each First Unionsite with 60 (10 + 50) First Union classifiersand 60 Chase classifiers. Again, the sites usedhalf of their local data (one month) to test,prune, and metalearn the base classifiers andthe other half to evaluate the pruned orunpruned metaclassifier’s overall performance.To ensure fairness, we did not use the10 local classifiers in metalearning.The two databases, however, had the followingschema differences:• Chase and First Union defined a (nearlyidentical) feature with different semantics.• Chase includes two (continuous) featuresnot present in the First Union data.CHASESAVINGSFIRST UNIONSAVINGSCLASSIFICATION METHOD SIZE ($K) SIZE ($K)COTS scoring system from Chase N/A 682 N/A N/ABest base classifier over entire set 1 762 1 803Best base classifier over one subset 1 736 1 770Metaclassifier 50 818 50 944Metaclassifier 32 870 21 943Metaclassifier (+ First Union) 110 800 N/A N/AMetaclassifier (+ First Union) 63 877 N/A N/AMetaclassifier (+ Chase – bridging) N/A N/A 110 942Metaclassifier (+ Chase + bridging) N/A N/A 110 963Metaclassifier (+ Chase + bridging) N/A N/A 56 962For the first incompatibility, we mappedthe First Union data values to the Chasedata’s semantics. For the second incompatibility,we deployed bridging agents to computethe missing values. 5 When predicting,the First Union classifiers simply disregardedthe real values provided at the Chase datasites, while the Chase classifiers relied onboth the common attributes and the predictionsof the bridging agents to deliver a predictionat the First Union data sites.Table 3 summarizes our results for theChase and First Union banks, displaying thesavings for each fraud predictor examined.The column denoted as size indicates thenumber of base classifiers used in the ensembleclassification system. The first row ofTable 3 shows the best possible performanceof Chase’s own COTS authorization anddetection system on this data set. The nexttwo rows present the performance of the bestbase classifiers over the entire set and over asingle month’s data, while the last four rowsdetail the performance of the unpruned (sizeof 50 and 110) and pruned metaclassifiers(size of 32 and 63). The first two of thesemetaclassifiers combine only internal (fromChase) base classifiers, while the last twocombine both internal and external (fromChase and First Union) base classifiers. Wedid not use bridging agents in these experiments,because Chase defined all attributesused by the First Union classifier agents.Table 3 records similar data for the FirstUnion data set, with the exception of FirstUnion’s COTS authorization and detectionperformance (it was not made available to us),and the additional results obtained whenemploying special bridging agents from Chaseto compute the values of First Union’s missingattributes. Most obviously, these experimentsshow the superior performance of metalearningover the single-model approaches andover the traditional authorization and detectionsystems (at least for the given data sets).The metaclassifiers outperformed the singleNOVEMBER/DECEMBER 1999 73

ase classifiers in every category. Moreover,by bridging the two databases, we managedto further improve the metalearning system’sperformance.However, combining classifiers’ agentsfrom the two banks directly (without bridging)is not very effective, no doubt becausethe attribute missing from the First Uniondata set is significant in modeling the Chasedata set. Hence, the First Union classifiersare not as effective as the Chase classifierson the Chase data, and the Chase classifierscannot perform at full strength at the FirstUnion sites without the bridging agents.This table also shows the invaluable contributionof pruning. In all cases, pruning succeededin computing metaclassifiers with similaror better fraud-detection capabilities, whilereducing their size and thus improving theirefficiency. We have provided detailed descriptionon the pruning methods and a comparativestudy between predictive performance andmetaclassifier throughput elsewhere. 7ONE LIMITATION OF OUR APproachto skewed distributions is the need torun preliminary experiments to determine thedesired training distribution based on a definedcost model. This process can be automated,but it is unavoidable because thedesired distribution highly depends on the costmodel and the learning algorithm. Currently,for simplicity reasons, all the base learners usethe same desired distribution; using an individualizedtraining distribution for each baselearner could improve the performance. Furthermore,because thieves also learn and fraudpatterns evolve over time, some classifiers aremore relevant than others at a particular time.Therefore, an adaptive classifier-selectionmethod is essential. Unlike a monolithicapproach of learning one classifier usingincremental learning, our modular multiclassifierapproach facilitates adaptation over timeand removes out-of-date knowledge.Our experience in credit card fraud detectionhas also affected other important applications.Encouraged by our results, we shiftedour attention to the growing problem of intrusiondetection in network- and host-basedcomputer systems. Here we seek to performthe same sort of task as in the credit card frauddomain. We seek to build models to distinguishbetween bad (intrusions or attacks) andgood (normal) connections or processes. Byfirst applying feature-extraction algorithms,followed by the application of machine-learningalgorithms (such as Ripper) to learn andcombine multiple models for different typesof intrusions, we have achieved remarkablygood success in detecting intrusions. 9 Thiswork, as well as the results reported in thisarticle, demonstrates convincingly that distributeddata-mining techniques that combinemultiple models produce effective fraud andintrusion detectors.AcknowledgmentsWe thank the past and current participants ofthis project: Charles Elkan, Wenke Lee, ShelleyTselepis, Alex Tuzhilin, and Junxin Zhang. Wealso appreciate the data and expertise provided byChase and First Union for conducting this study.This research was partially supported by grantsfrom DARPA (F30602-96-1-0311) and the NSF(IRI-96-32225 and CDA-96-25374).References1. P. Chan and S. Stolfo, “Metalearning for Multistrategyand Parallel Learning,” Proc. SecondInt’l Workshop Multistrategy Learning,Center for Artificial Intelligence, GeorgeMason Univ., Fairfax, Va., 1993, pp. 150–165.2. S. Stolfo et al., “JAM: Java Agents for Metalearningover Distributed Databases,” Proc.Third Int’l Conf. Knowledge Discovery andData Mining, AAAI Press, Menlo Park,Calif., 1997, pp. 74–81.3. D. Wolpert, “Stacked Generalization,” NeuralNetworks, Vol. 5, 1992, pp. 241–259.4. P. Chan and S. Stolfo, “Toward ScalableLearning with Nonuniform Class and CostDistributions: A Case Study in Credit CardFraud Detection,” Proc. Fourth Int’l Conf.Knowledge Discovery and Data Mining,AAAI Press, Menlo Park, Calif., 1998, pp.164–168.5. A. Prodromidis and S. Stolfo, “Mining Databaseswith Different Schemas: IntegratingIncompatible Classifiers,” Proc. Fourth IntlConf. Knowledge Discovery and Data Mining,AAAI Press, Menlo Park, Calif., 1998,pp. 314–318.6. Z. Ras, “Answering Nonstandard Queries inDistributed Knowledge-Based Systems,”Rough Sets in Knowledge Discovery, Studiesin Fuzziness and Soft Computing, L. Polkowskiand A. Skowron, eds., Physica Verlag,New York, 1998, pp. 98–108.7. A. Prodromidis and S. Stolfo, “Pruning Metaclassifiersin a Distributed Data Mining System,”Proc. First Nat’l Conf. New InformationTechnologies, Editions of New Tech.Athens, 1998, pp. 151–160.8. D. Margineantu and T. Dietterich, “PruningAdaptive Boosting,” Proc. 14th Int’l Conf.Machine Learning, Morgan Kaufmann, SanFrancisco, 1997, pp. 211–218.9. W. Lee, S. Stolfo, and K. Mok, “Mining in aData-Flow Environment: Experience in NetworkIntrusion Detection,” Proc. Fifth Int’lConf. Knowledge Discovery and Data Mining,AAAI Press, Menlo Park, Calif., 1999,pp. 114–124.Philip K. Chan is an assistant professor of computerscience at the Florida Institute of Technology.His research interests include scalable adaptivemethods, machine learning, data mining,distributed and parallel computing, and intelligentsystems. He received his PhD, MS, and BS in computerscience from Columbia University, VanderbiltUniversity, and Southwest Texas State University,respectively. Contact him at ComputerScience, Florida Tech, Melbourne, FL 32901;pkc@cs.fit. edu; www.cs.fit.edu/~pkc.Wei Fan is a PhD candidate in computer science atColumbia University. His research interests are inmachine learning, data mining, distributed systems,information retrieval, and collaborative filtering. Hereceived his MSc and B.Eng from Tsinghua Universityand MPhil from Columbia University. Contacthim at the Computer Science Dept., ColumbiaUniv., New York, NY 10027; wfan@cs.columbia.edu;www.cs.columbia.edu/~wfan.Andreas Prodromidis is a director in research anddevelopment at iPrivacy. His research interestsinclude data mining, machine learning, electroniccommerce, and distributed systems. He received aPhD in computer science from Columbia Universityand a Diploma in electrical engineering fromthe National Technical University of Athens. He isa member of the IEEE and AAAI. Contact him atiPrivacy, 599 Lexington Ave., #2300, New York,NY 10022; andreas@iprivacy.com; www.cs.columbia. edu/~andreas.Salvatore J. Stolfo is a professor of computer scienceat Columbia University and codirector of theUSC/ISI and Columbia Center for AppliedResearch in Digital Government Information Systems.His most recent research has focused on distributeddata-mining systems with applications tofraud and intrusion detection in network informationsystems. He received his PhD from NYU’sCourant Institute. Contact him at the ComputerScience Dept., Columbia Univ., New York, NY10027; sal@cs.columbia.edu; www.cs.columbia.edu/sal; www.cs.columbia.edu/~sal.74 IEEE INTELLIGENT SYSTEMS

Distributed Data Mining in Credit Card Fraud Detection

Create successful ePaper yourself

Delete template?

Save as template?