10.07.2015 Views

A Method for Detecting Discontinuous Probability Density Function ...

A Method for Detecting Discontinuous Probability Density Function ...

A Method for Detecting Discontinuous Probability Density Function ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A <strong>Method</strong> <strong>for</strong> <strong>Detecting</strong> <strong>Discontinuous</strong> <strong>Probability</strong><strong>Density</strong> <strong>Function</strong> from DataIoan V. Lemeni **University of Craiova, Craiova, Romania (Tel: 0251-435666; e-mail: ioan.lemeni@ cs.ucv.ro).Abstract: In real classification applications the patterns of different classes often overlap. In thissituation the most appropriate classifier is the one whose outputs represent the class conditionalprobabilities. These probabilities are calculated in traditional statistics indirectly, in two steps: firstthe underlying prior probabilities are estimated and then the Bayes rule is applied. Popularmethods <strong>for</strong> density estimation are Parzen Window and Gaussian Mixture. It is also possible tocalculate directly the class conditional probabilities using the logistic regression, k-NearestNeighbours algorithm or a Multilayer Perceptron Artificial Neural Network. Many methods, director indirect, per<strong>for</strong>m poorly when the underlying prior probability densities are discontinuous alongthe support’s border. This paper will present a method <strong>for</strong> detecting the discontinuity by analyzingsamples drawn from the underlying density. Knowing the densities are discontinuous will help tochoose an estimator insensitive to discontinuities.Keywords: probability density function estimation, class conditional probability estimation1. INTRODUCTIONIn statistical pattern recognition, classification meansassigning an object or a fact to a predefined class. Theobject or fact is represented by a subset of its attributes,say d. Supposing that the attributes are numerical or canbe converted to numbers, each object’s representationbecomes a point, or a vector, in R d . The classifier has todistinguish between classes trying to isolate homogenousregions. A region is considered homogenous if it containsvectors belonging to one class only.Un<strong>for</strong>tunately, homogenous regions are rare; most of thetimes there is a degree of overlapping between them. Thisdegree of overlapping is due to the fact that someessential attributes were not recorded. Financial data setsare well known <strong>for</strong> their high degree of overlapping. If thedata set exhibits such a degree of overlapping, the bestclassifier is the one whose outputs represent a posterioriconditional probabilities or, in other words, the classconditional probability.There are two approaches to calculate the classconditional probabilities. According to the first, theseprobabilities can be calculated using the well knownBayes rule from the statistics field:p(x | ωk) P(ωk)P(ωk| x)==p(x)∑ip(x | ωk) P(ωk)[ p(x | ω ) P(ω )]In (1), P(ω k ) is the probability of class ω k , p(x) is theprobability density function (pdf) of the feature vector xand p(x|ω k ) is the conditional probability density of x inclass ω k , known a priori. If we choose to follow theii(1)traditional statistical approach in order to calculate theclass conditional probability, first we have to estimate theprobability density of x in every class. Two well knownmethods used to estimate the probability density functionfrom data are Parzen window and Gaussian MixtureModel (GMM).The second approach relays on a special <strong>for</strong>m of the errorfunction. Ruck et al. (1990) showed that the outputs of aMultilayer Perceptron (MLP), when trained as a classifierusing backpropagation, approximate the posteriorconditional probabilities. This finding is not specific toMLP, but it holds true <strong>for</strong> any classifier that uses the sumof-squaresor the likelihood as error function. In thiscategory also fall the logistic regression and k-NearestNeighbors algorithm.The per<strong>for</strong>mance of many classifiers deteriorates whenthe underlying prior probability densities p(x|ω k ) arenot fully supported in R d and, additionally, these densitiesare not continuous along the support’s frontier. Forexample, the standard uni<strong>for</strong>m distribution is supported inthe interval [0, 1] only and it is discontinuous in 0 and 1;the exponential distribution’s support is [0, ∞), beingdiscontinuous in 0.The discontinuity does not affect the classifiers in thesame degree: some classifiers are more affected than theothers but there are classifiers that are not affected at all.If the closed <strong>for</strong>m of the prior densities are known, it iseasy to say if they are discontinuous or not and choose aclassifier less sensitive. But in real life situations, thedensities must be estimated from input data.


Consequently, the closed <strong>for</strong>m is unknown, so we cannotsay if they are discontinuous or not.The remainder of this paper is organized as follows.Section 2 presents the effect of the discontinuity ondensity estimation and Section 3 presents a method thatdetects the discontinuity analyzing samples drawn fromthat discontinuous density.2. THE EFFECT OF THE DISCONTINUITYOne widely used method <strong>for</strong> density estimation is theParzen window. Let x 1 , x 2 ,…., x n be n independent andidentical distributed samples drawn from somedistribution with an unknown density f. <strong>Density</strong> f can beestimated asnˆ 1f = ∑nhK hi=1( x − x )iwhere K(·) is the kernel and h>0 is the smoothingparameter or bandwidth. This kind of estimation is calledkernel density estimator or Parzen window. The kernel isa symmetric function that integrates to one. A range ofkernel functions are commonly used: uni<strong>for</strong>m, triangular,biweight, triweight, Epanechnikov and normal.However, the estimator does not take into account thepotential finite support of the feature vector x. When thesupport of some of its components is bounded, <strong>for</strong>example, in the case of nonnegative data, the standardkernel estimator gives weight outside the support. Thiscauses a bias in the boundary region. In order to outlinethe Parzen window behaviour <strong>for</strong> discontinuous density,we will consider the standard exponential density. Wedrawn 1000 points and then we estimated the densityusing the Parzen window. The result is shown in Fig. 1.Fig. 1 Exponential density estimated from 1000 inputvectors by a Parzen estimator.The boundary bias problem of the standard kernel is welldocumented in the one-dimensional case, many solutionsbeing proposed. In the one dimensional case the randomvariable is either positive, being discontinuous in zero, oris restricted to (0, 1) interval, being discontinuous in 0 and1. An initial solution to the boundary problem is given bySchuster (1985), who proposes to generate data outsidesupport by reflection. This solution is simple and easy toimplement but it only works when the first derivative of(2)the generator density is zero to the right or to the left ofthe discontinuity. As this requirement is rarely fulfilled,Cowling and Hall (1996) also generated new data outsidethe support but their location is determined now byinterpolation, not by reflection. A different solutionproposed by Marron and Ruppert (1994), consists intrans<strong>for</strong>ming the input data such as the discontinuity ordiscontinuities disappear. The function g(.) trans<strong>for</strong>ms thedata such as g(0) and possibly g(1) are zero. Müller(1991) and many others suggest the use of adaptiveboundary kernels at the edges and a fixed standard kernelin the interior region. More recently, Bouezmarni andRombouts (2006) study the gamma kernels <strong>for</strong> univariatenonnegative data.The boundary bias problem becomes more severe in themultivariate case because the boundary region increaseswith the dimension of the support and become morecomplex. In the one-dimensional case the frontier is apoint and can be easily detected but in themultidimensional case it becomes a hypersuface. In thiscase any solution presented so far can be adapted only ifthe closed <strong>for</strong>m of the boundary is known. In real life thisis rarely the case, so Parzen estimator must be avoided ifdata are bounded and the boundary is not known.Another popular method in probability density estimationis the Gaussian Mixture Model (GMM). The probabilitydensity function of the observed data is represented as asuperposition of m multivariate Gaussian pdfsm∑p GM( x)= ρ N(x, , C )j=1jμ jjwhere x is a d-dimensional observation from the data set,ρ j ( , j=1,…,m), are the component weights, summing up tounity, ∑ ρj= 1 and N(x, μ j , C j ) is a multivariate normaldensity with mean vector μ j and covariance matrix C j .The negative log-likelihood of a data set made of nobservations, given bynn⎧mE = − lnL = −∑lnp(xi) =− ∑ln⎨∑ρjN(xi, μj, Cj)= = ⎩ =⎭ ⎬⎫ (4)i 1i 1 j 1can be used as an error function. Its minimization throughthe expectation-maximization (EM) algorithm offers theoptimal values <strong>for</strong> the GMM’s parameters ρ j , μ j and C j . Areview of mixture models, maximum likelihood and EMalgorithm has been given by Redner (1984).GMM gives very good results if the densities to beestimated are continuous. It can be shown that anycontinuous pdf can be approximated arbitrarily closely bya Gaussian mixture density. But little work has been donein the field of discontinuous densities, being only twopapers that address this problem.Hedelin and Skoglund (2000), developing a method <strong>for</strong>high-rate vector quantization in speech coding, found thatGMM per<strong>for</strong>mance is degraded when the density to beestimated has bounded support. In order to outline theGMM behaviour <strong>for</strong> discontinuous density, we’ll considerthe example used in Parzen window. Same 1000 points(3)


were used to estimate the exponential density using a GMwith 5 components. The result is shown in Fig. 2.weight of a person are normally distributed with norestriction. On the other hand, other quantities such as thegross salary, must obey a legal constraint, namely theminimum living wage.Fig. 2. Exponential density estimated from 1000 inputvectors by a GMM with 5 components.Hedelin and Skoglund proposed a version of the EMalgorithm <strong>for</strong> such densities and named the new algorithmEMBS. Although the EMBS algorithm provides aninteresting solution to the bounded support issue, it cannotbe used when the analytical <strong>for</strong>m of the support is notknown.Another author that emphasizes the poor per<strong>for</strong>mance ofGMM <strong>for</strong> discontinuous densities is Likas. In Likas(2001) the author developed a method <strong>for</strong> pdf estimationbased on a multilayer perceptron neural network andcompared it against the traditional GMM. The densitiesestimated exhibit discontinuities that affect theper<strong>for</strong>mance of GMM.A detailed analysis of GMM versus MLP per<strong>for</strong>mance <strong>for</strong>discontinuous densities can be found in Lemeni (2009).The paper outline the oscillating behaviour of GMM dueto the discontinuity through one and two-dimensionalexamples and the good per<strong>for</strong>mance of MLP in the sameexamples. The author claims that MLP is superior toGMM <strong>for</strong> class conditional probability estimation whenthe underlying priors are discontinuous.<strong>Discontinuous</strong> densities by definition, such as theexponential and the uni<strong>for</strong>m densities, as well astruncated densities were used to asses the per<strong>for</strong>mance oftwo other methods: logistic regression and k-NearestNeighbours (kNN). The results sowed that the logisticregression is not affected by the discontinuity but kNN is.Because some estimators are severely affected in thevicinity of the discontinuity, we need a method that isable to analyse samples drawn from an unknown densityand tell if that density is discontinuous or not.3. THE MASS CENTER METHOD FORDISCONTINUITY DETECTIONIn many real life situations, the densities to be estimatedfrom input data originate from fully supported densitieswhich have been truncated. Such a situation is common inthe financial field.Financial data are constrained, those constrains beingeither an effect of a law or specific to a particularcompany. For instance, quantities such as height orFig. 3. Minimum living wage histogram in Romania, year2005.Fig. 3 presents the histogram of the minimum living wagein Romania, year 2005, taken from a study of the Groupof Applied Economics (2008) on the impact of the flatincome tax. The distribution’s shape seems to be normal,except the missing left side of the graphic. The truncationis the effect of the minimum living wage law and makesthe distribution discontinuous <strong>for</strong> this value. Theminimum living wage was 330 RON in 2005.Banks often use such quantities, e.g. gross salary and age,<strong>for</strong> classification and regression. If a bank accepted all theapplications <strong>for</strong> a credit with no restriction, salary and agehistograms would have the shape in Fig. 3, because therecannot be salaries less than the minimum living wage orclients younger than 18 years.Generally, if the histogram of an attribute describing areal object, fact or event has the shape from Fig. 3, this isa clear indication of a discontinuous underlying density.Constrains considered be<strong>for</strong>e are generated by the currentlegislation and the financial institutions must obey them.But there are a second type of constrains, this time selfimposed.For example, a bank trying to lend money onlyto non default customers, will accept an application if acertain criterion is met. Such a criterion is the credit scoreof the requester. Most of the time the credit score iscalculated with Fisher’s linear discriminant or one of itsvariants. The credit-score criterion will act as a knife,cutting the data space, so there will be records only <strong>for</strong>borrowers with a credit-score greater than a certainthreshold. In the attribute space the credit-score criterionwill act as a separating hypersurface: on one side there areno vectors, many vectors being on the other side in thevicinity of that hypersurface. Moreover, in realapplications, the analytical <strong>for</strong>m of the separatinghypersurface may be not known. In many cases we do noteven realize that the support has a border and the pdf to beestimated is discontinuous along this border. As it will beshown soon, the attributes’ histograms are useless insituations like these.


As be<strong>for</strong>e, let’s suppose that potential customers of a bankare described by two attributes, salary and age. Afternormalization they are Gaussian distributed with meanμ=(1, 1) and covariance matrix Σ=I. Let’s again supposethat the bank’s credit-score is calculated asnsalary+nage≥2, where the leading n stands <strong>for</strong>“normalized”. This criterion creates the distribution of theaccepted customers. Its support consists of all the pointsin R 2 <strong>for</strong> witch the score criterion is fulfilled. Thus thedistribution of accepted customers is obtained from thedistribution of potential customers by truncations alongthe line of equation nsalary+nage=2. This line is thesupport’s frontier. In Fig. 4, 4000 random vectors weredrawn from the distribution of the potential customers andonly those fulfilling the score criterion were plotted.presented in Fig. 4. Un<strong>for</strong>tunately this approach is almostimpossible <strong>for</strong> three dimensions and impossible <strong>for</strong> fourdimensions or more.For more than two dimensions the discontinuity can bedetected if we analyze the spatial distribution of kneighbors of an input vector. The analysis must beconducted <strong>for</strong> all available input vectors. All those kneighbors lie in a d-sphere centered at the input vectorwith radius given by the distance between that inputvector and the farthest neighbor. If the number k ofneighbors is chosen so the generator density is constant inthe vicinity of the input vector under analysis and everyneighbor is seen as a small particle with one unit mass,then the mass center of all neighbors will overlap thegeometric center of the d-sphere surrounding them. Thissituation is depicted in Fig. 6a.Fig. 4. Example of truncated data.The data layout is typical <strong>for</strong> a financial institution thatapplied a score criterion. Hereinafter we want dedetermine how this type of truncation reflects in theattributes’ histograms.In Fig. 5 is presented the histogram of nsalary attributeafter truncation.Fig. 5. Histogram of the nsalary attribute after truncation.The histogram of nage has the same shape and has notbeen presented. The histogram rather suggests a normaldistribution and doesn’t offer any indication on thediscontinuity at the support’s border. For the twodimensionalexample presented so far is easy to detect thediscontinuity if we build a scatter plot as the onea) b)Fig. 6. Position of the neighbors’ mass center relative totheir geometric centre.On the other hand, if the input vector under analysis liesexactly on the support’s border and, additionally, thesupport’s border can be approximated by a hyper plane,then the neighbors will <strong>for</strong>m a homogenous hemi d-spherewith radius R. The position of the hemi d-sphere’s masscenter will no longer coincide with the geometric center.For d=2 the mass center lies at 4R/(3π) units apart fromthe geometric center and <strong>for</strong> d=3, at 3R/8. Fig. 6b presentsthe position of the mass center <strong>for</strong> d=2.In real conditions the neighbors’ density is not perfectuni<strong>for</strong>m, so a 5% deviation of mass center’s position fromthe position corresponding to the uni<strong>for</strong>m distribution isaccepted.Considering the example from Fig. 4, the input vectorswith the mass center placed 4R/3π ±5% apart from thegeometric center are represented as black squares in Fig.7. The radius R is chosen such as the d-sphere (circle <strong>for</strong>d=2) of the input vector under analysis contains 40neighbors. The optimal number of neighbors will bediscussed later.Analyzing the distribution of the black squares in figure,we notice two input vector categories: input vectors lyingon the support’s frontier, such as v1, and vectors lying inlow density zones, such as v2. The mass center criterionindicates correctly the “on border” position <strong>for</strong> the vectorsbelonging to the first category and fails <strong>for</strong> the second.The category can be decided by the radius of theneighbors’ circle. An analysis of the radius of the


Now, the graph of the number of vectors lying on borderas a function of the number of neighbors has an L shape.number of vectors lying on border as a function of thenumber of neighbors is shown in Fig. 10. The graph has aV shape and the number of “on border” vectors is neverzero. These two characteristics are a clear indication thatthe CEC data set is discontinuous.4. CONCLUSIONSMany data sets originating especially from financial fieldare truncated due to some legal restrictions or a selfimpose criterion. The discontinuity along the support’sborder affects the per<strong>for</strong>mance of some estimators such asParzen window, GMM or kNN while other estimatorssuch as MLP or logistic regression are less affected.This paper will present the mass center method <strong>for</strong>detecting the discontinuity. Knowing the densities arediscontinuous will help to choose an estimator insensitiveto discontinuities. The per<strong>for</strong>mance of the mass centermethod was tested in many runs on artificial data drawnfrom different densities, with different dimensions,truncation criteria and size. The method was always ableto indicate the presence of the discontinuity.Fig. 9 Vectors lying on border as a function of the numberof neighbors.Similar results have been obtained <strong>for</strong> three dimensionaldata. The generator density used in this example wasN ( μ = (1,1,1), Σ = I)and the truncation criterion wasx1+ x2+ x3> 3. The graph of the number of vectors lyingon border as a function of the number of neighbors <strong>for</strong>both the truncated and non truncated densities arepresented in Fig. 9b. We notice a remarkable similaritywith the graph corresponding to the two dimensional case.The next test has been carried out on real data originatingfrom the credit data base of CEC Bank.Fig. 10 Vectors lying on border as a function of thenumber of neighbors <strong>for</strong> CEC Bank credit records.The records from data base were preprocessed and thenused to create a data set consisting of 7502 fourdimensionalvectors. This data set was tested <strong>for</strong>discontinuity by the mass center method. The graph of theREFERENCESBouezmarni, T., and Rombouts, J. (2006).“Nonparametric <strong>Density</strong> Estimation <strong>for</strong> Positive TimeSeries,” CORE discusion paper 2006/85.Group of Applied Economics (2008). Available on line at:http://www.gea.org.ro/documente/ro/studii/cota_unica/cota_unica_ppt.pptHedelin, P. and Skoglund, J. (2000). “Vector quantizationbased on Gaussian mixture models,” IEEE Trans.Speech Audio Processing, vol. 8, no. 4, pp. 385-401.Lemeni, I. (2009) "Multilayer Perceptron versus GaussianMixture <strong>for</strong> Class <strong>Probability</strong> Estimation with<strong>Discontinuous</strong> Underlying Prior Densities" in Proc.IEEE Computing in the Global In<strong>for</strong>mationTechnology, ICCGI '09, pp. 240-245Likas, A. (2001) “<strong>Probability</strong> <strong>Density</strong> Estimation UsingArtificial Neural Networks,” Computer PhysicsCommunications, vol. 135, no. 2, pp. 167-175.Marron, J., and Ruppert, D. (1994). “Trans<strong>for</strong>mations toReduce Boundary Bias in Kernel <strong>Density</strong> Estimation,”Journal of the Royal Statistical Society, Series B, 56,653–671.Müller, H. (1991). “Smooth Optimum Kernel Estimatorsnear Endpoints,” Biometrika, 78, 521–530.Redner, R. A. and Walker, H. F. (1984). “Mixturedensities, maximum likelihood and the EMalgorithm,” SIAM Rev., vol. 26, no. 2, pp. 195–239Ruck, D. W. , Rogers, S. K. and Other (1990). “Themultilayer perceptron as an approximation to aBayesoptimal discriminant function,” IEEE Trans.Neural Networks, vol. 1, pp. 296–298.Schuster, E. (1985): “Incorporating Support Constraintsinto Nonparametric Estimators of Densities,”Communications in Statistics - Theory and <strong>Method</strong>s,14, 1123–1136.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!