11.07.2015 Views

View - Universidad de Almería

View - Universidad de Almería

View - Universidad de Almería

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

68 E. Carrizosa, B. Martín-Barragán, F. Plastria, and D. Romero-Moralesand not nominal or ordinal. Moreover, no blanks are allowed, which exclu<strong>de</strong>s its direct usefor cases in which some measures are missing or simply do not apply. See e.g. Cristianini andShawe-Taylor [5], Freed and Glover [9], Gehrlein [10], Gochet et al. [12] and Mangasarian [20].A more flexible methodology, which just requires the knowledge of a metric (or, with moregenerality, a dissimilarity), is the Nearest Neighbor (NN) method [4, 6, 7, 14], which provi<strong>de</strong>s,as documented e.g. in [16], excellent results.In Nearest Neighbor methods, for each new entry i, the distances (or dissimilarities) d(i, j)to some individuals j in the database (called prototypes) are computed, and i is classified accordingto such set of distances. In particular, in the classical NN, [4], all individuals are prototypes,and i is classified as member of class c ∗ to which its closest prototype j ∗ (satisfyingd(i, j ∗ ) ≤ d(i, j) ∀j) belongs.A generalization of the NN is the k-NN, e.g. [7], which classifies each i in the class mostfrequently found in the set of k prototypes closest to i. In particular, the NN is the k-NN fork = 1.These classification rules, however, require distances to be calculated to all data in thedatabase for each new entry, involving high storage and time resources, making it impracticalto perform on-line queries.For these reasons, several variants have been proposed in the last three <strong>de</strong>ca<strong>de</strong>s, see e.g.[1,2,6,7,11,13,17,18] and the references therein. For instance, Hart [13] suggests the Con<strong>de</strong>nsedNearest Neighbor (CNN) rule, in which the full database J is replaced by a certain subset I,namely, a so-called minimal consistent subset: a subset of records such that, if the NN is usedwith I (instead of J) as set of prototypes, all points in J are classified in the correct classes.Since such minimal consistent subset can still be too large, several procedures have beensuggested to reduce its size. Although such procedures do not necessarily classify correctlyall the items in the database, (i.e., they are not consistent), they may have a similar or evenbetter behavior to predict class membership on future entries because they may reduce thepossible overfitting suffered by the CNN rule, see e.g. [3, 19].In this talk we propose a new mo<strong>de</strong>l, in which a set of prototypes I ⊂ J, of prespecifiedcardinality p is sought, minimizing an empirical misclassification cost. Hence, if p is takengreater or equal than the cardinality p ∗ of a minimal consistent subset, then all individualsin the training sample will be correctly classified, yielding an empirical misclassification costof zero. On the other hand, for p < p ∗ we allow some data in the training sample to beincorrectly classified with the hope of reducing the possible overfitting which the classifierbased on a minimal consistent subset might cause. Additionally, the effort nee<strong>de</strong>d to classifya new entry is directly proportional to p, which may therefore serve in practice to gui<strong>de</strong> thechoice of an upper bound on p.For simplicity we restrict ourselves to the classification rule based on the closest distance,and hence can be seen as a variant of the NN rule. However, the results <strong>de</strong>veloped here extenddirectly to the case in which the k closest distances, k ≥ 1, are consi<strong>de</strong>red in the classificationprocedure, leading to a variant of the k-NN method.The talk is structured as follows. First, the mathematical mo<strong>de</strong>l is introduced, showingthat it is N P-Hard. Two Integer Programming formulations are proposed and theoreticallycompared. Numerical results are given, showing that, when the optimization problems aresolved exactly (with a standard MIP solver) the classification rules for p < p ∗ behaves betterthan or equal to the CNN rule, but with enormous preprocessing times. For this reason, aheuristic procedure is also proposed, its quality and speed being also explored. It is shownthat the rules obtained with this heuristic procedure have similar behavior on testing samplesas the optimal ones.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!