Distributed Data Mining in Credit Card Fraud Detection

More documents

Recommendations

Info

The AdaCost algorithmOne of the most important results of our experimental work on thisdomain was the realization that each of the base-learning algorithmsemployed in our experiments utilized internal heuristics based on trainingaccuracy, not cost. This leads us to investigate new algorithms thatemploy internal metrics of misclassification cost when computing hypothesesto predict fraud.Here, we describe AdaCost 1 (a variant of AdaBoost 2,3 ), which reducesboth fixed and variable misclassification costs more significantly thanAdaBoost. We follow the generalized analysis of AdaBoost by RobertSchapire and Yoram Singer. 3 Figure A shows the algorithm. Let S =((x 1 ,c 1 ,y 1 ), …, (x m ,c m ,y m )) be a sequence of training examples where eachinstance x i belongs to a domain X, each cost factor c i belongs to the nonnegativereal domain R + , and each label y i belongs to a finite label spaceY. We only focus on binary classification problems in which Y = {–1,+1}. h is a weak hypothesis—it has the form h:X – R. The sign of h(x) isinterpreted as the predicted label, and the magnitude |h(x)| is the “confidence”in this prediction. Let t be an index to show the round of boostingand D t (i) be the weight given to (x i ,c i ,y i ) at the tth round. 0 ≤ D t (i) ≤1, andΣD t (i) = 1 is the chosen parameter as a weight for weak hypothesis h t atthe tth round. We assume α t > 0. β(sign(y i h t (x i )),c i ) is a cost-adjustmentfunction with two arguments: sign(y i h t (x i )) to show if h t (x i ) is correct, andthe cost factor c i .The difference between AdaCost and AdaBoost is the additional costadjustmentfunction β(sign(y i h t (x i )),c i ) in the weight-updating rule.Given: (x1, c1, y1), …,(x m , c m , y m ):x i ∈ X,c i ∈ R+,y i ∈ {–1,+1}Initialize D1(i) (such as D 1 (i) = (such as D 1 (i) = c i /∑ m j c j )For t = 1, …,T:1. Train weak learner using distribution D t .2. Compute weak hypothesis h t : X→R.3. Choose α t ∈R and β(i) ∈ R +4. UpdateFigure A. AdaCost.Dt i − tyh i t xiy h x cDtii t i i+ 1 () ( ) exp α ( ) β( sign( ( )), )=Ztwhere β(sign(y i h t (x i )),c i ) is a cost-adjustment function.Z t is a normalization factor chosen so that D t+1will be a distribution.Output the final hypothesis:H(x) = sign(f(x)) where f(x) =( )⎛ T ⎞⎜∑α t h t ( x)⎟⎝ t=1 ⎠Where it is clear in context, we use either β(i) or β(c i ) as a shorthand forβ(sign(y i h t (x i )),c i ). Furthermore, we use β + when sign (y i h t (x i )) = +1 andβ – when sign(y i h t (x i )) = –1. For an instance with a higher cost factor, β(i)increases its weights “more” if the instance is misclassified, butdecreases its weight “less” otherwise. Therefore, we require β – (c i ) to benondecreasing with respect to c i ,β + (c i ) to be nonincreasing, and both arenonnegative. We proved that AdaCost reduces cost on the training data. 1Logically, we can assign a cost factor c of tranamt – overhead tofrauds and a factor c of overhead to nonfrauds. This reflects how the predictionerrors will add to the total cost of a hypothesis. Because theactual overhead is a closely guarded trade secret and is unknown to us,we chose to set overhead ∈ {60, 70, 80, 90} to run four sets ofexperiments. We normalized each c i to [0,1]. The cost adjustment functionβ is chosen as: β – (c) = 0.5⋅c + 0.5 and β + (c) = –0.5⋅c + 0.5.As in previous experiments, we use training data from one month anddata from two months later for testing. Our data set let us form 10 suchpairs of training and test sets. We ran both AdaBoost and AdaCost to the50th round. We used Ripper as the “weak” learner because it provides aneasy way to change the training set’s distribution. Because using thetraining set alone usually overestimates a rule set’s accuracy, we used theLaplace estimate to generate the confidence for each rule.We are interested in comparing Ripper (as a baseline), AdaCost, andAdaBoost in several dimensions. First, for each data set and cost model,we determine which algorithm has achieved the lowest cumulative misclassificationcost. We wish to know in how many cases AdaCost is theclear winner. Second, we also seek to know, quantitatively, the differencein cumulative misclassification cost of AdaCost from AdaBoost and thebaseline Ripper. It is interesting to measure the significance of these differencesin terms of both reduction in misclassification loss and percentageof reduction. Finally, we are interested to know if AdaCost requiresmore computing power.Figure B plots the results from the Chase Bank’s credit card data. FigureB1 shows the average reduction of 10 months in percentage cumulativeloss (defined as cumulativeloss/ maximalloss – leastloss * 100%) forAdaBoost and AdaCost for all 50 rounds with an overhead of $60 (resultsfor other overheads are in other work 1 ). We can clearly see that there is aconsistent reduction. The absolute amount of reduction is around 3%We also observe that the speed of reduction by AdaCost is quickerthan that of AdaBoost. The speed is the highest in the first few rounds.This means that in practice, we might not need to run AdaCost for manyrounds. Figure B2 plots the ratio of cumulative cost by AdaCost andAdaBoost. We have plotted the results of all 10 pairs of training and testmonths over all rounds and overheads. Most of the points are above the y= x line in Figure B2, implying that AdaCost has lower cumulative loss inan overwhelming number of cases.References1. W. Fan et al., “Adacost: Misclassification Cost-SensitiveBoosting,” Proc. 16th Int’l Conf. Machine Learning, Morgan Kaufmann,San Francisco, 1999, pp. 97–105.transactions are legitimate, although thisis equivalent to not detecting fraud at all.• Each transaction record has a differentdollar amount and thus has a variablepotential loss, rather than a fixed misclassificationcost per error type, as is commonlyassumed in cost-based miningtechniques.Our approach addresses the efficiency andscalability issues in several ways. We dividea large data set of labeled transactions (eitherfraudulent or legitimate) into smaller subsets,apply mining techniques to generate classifiersin parallel, and combine the resultant basemodels by metalearning from the classifiers’behavior to generate a metaclassifier. 1 Ourapproach treats the classifiers as black boxesso that we can employ a variety of learningalgorithms. Besides extensibility, combiningmultiple models computed over all availabledata produces metaclassifiers that can offsetthe loss of predictive performance that usuallyoccurs when mining from data subsets orsampling. Furthermore, when we use thelearned classifiers (for example, during transactionauthorization), the base classifiers canexecute in parallel, with the metaclassifier thencombining their results. So, our approach ishighly efficient in generating these models andalso relatively efficient in applying them.Another parallel approach focuses on parallelizinga particular algorithm on a particu-68 IEEE INTELLIGENT SYSTEMS
Percentage cumulative misclassification cost0.3950.390.3850.380.3750.370.3650.360 5 10 15 20 25 30 35 40 45 50(1)Boosting RoundsAdaBoost cumulative cost(2)0.520.50.480.460.440.420.40.380.360.340.322. Y. Freund and R. Schapire, “Experiments with a New Boosting Algorithm,” Proc. 13th Conf.Machine Learning, Morgan Kaufmann, San Francisco, 1996, pp. 148–156.3. R. Schapire and Y. Singer, “Improved Boosting Algorithms Using Confidence-Rated Predictions,”Proc. 11th Conf. Computational Learning Theory, ACM Press, New York, 1998.lar parallel architecture. However, a new algorithmor architecture requires a substantialamount of parallel-programming work. Althoughour architecture- and algorithm-independentapproach is not as efficient as somefine-grained parallelization approaches, it letsusers plug different off-the-shelf learning programsinto a parallel and distributed environmentwith relative ease and eliminates theneed for expensive parallel hardware.Furthermore, because our approach couldAdaBoostAdaCostOverhead = 60y=xPercentagel of AdaCost and AdaBoost0.30.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5AdaCost cumulative costFigure B. Cumulative cost versus rounds (1); AdaBoost versus AdaCost in cumulative cost (2).generate a potentially large number of classifiersfrom the concurrently processed datasubsets, and therefore potentially require morecomputational resources during detection, weinvestigate pruning methods that identifyredundant classifiers and remove them fromthe ensemble without significantly degradingthe predictive performance. This pruning techniqueincreases the learned detectors’ computationalperformance and throughput.The issue of skewed distributions has notbeen studied widely because many of the datasets used in research do not exhibit this characteristic.We address skewness by partitioningthe data set into subsets with a desired distribution,applying mining techniques to thesubsets, and combining the mined classifiersby metalearning (as we have already discussed).Other researchers attempt to removeunnecessary instances from the majorityclass—instances that are in the borderlineregion (noise or redundant exemplars) are candidatesfor removal. In contrast, our approachkeeps all the data for mining and does notchange the underlying mining algorithms.We address the issue of nonuniform costby developing the appropriate cost model forthe credit card fraud domain and biasing ourmethods toward reducing cost. This costmodel determines the desired distributionjust mentioned. AdaCost (a cost-sensitiveversion of AdaBoost) relies on the cost modelfor updating weights in the training distribution.(For more on AdaCost, see the “Ada-Cost algorithm” sidebar.) Naturally, this costmodel also defines the primary evaluationcriterion for our techniques. Furthermore, weinvestigate techniques to improve the costperformance of a bank’s fraud detector byimporting remote classifiers from otherbanks and combining this remotely learnedknowledge with locally stored classifiers.The law and competitive concerns restrictbanks from sharing information about theircustomers with other banks. However, theymay share black-box fraud-detection models.Our distributed data-mining approachprovides a direct and efficient solution tosharing knowledge without sharing data. Wealso address possible incompatibility of dataschemata among different banks.We designed and developed an agentbaseddistributed environment to demonstrateour distributed and parallel data-miningtechniques. The JAM (Java Agents forMetalearning) system not only provides distributeddata-mining capabilities, it also letsusers monitor and visualize the various learningagents and derived models in real time.Researchers have studied a variety of algorithmsand techniques for combining multiplecomputed models. The JAM system providesgeneric features to easily implementany of these combining techniques (as wellas a large collection of base-learning algorithms),and it has been broadly available foruse. The JAM system is available for downloadat http://www.cs.columbia.edu/~sal/JAM/PROJECT. 2NOVEMBER/DECEMBER 1999 69
Page 1: D A T A M I N I N GDistributed Data
Page 5 and 6: Table 2. Cost and savings in the cr
Page 7 and 8: Table 3. Results on knowledge shari

Distributed Data Mining in Credit Card Fraud Detection

Create successful ePaper yourself

Delete template?

Save as template?