Model Selection Based on the Modulus of Continuity

the trade-off between the under-fitting and over-fitting of regression models. Here, the importantissue is measuring the model complexity related to the variance term. The statisticalmethods of model selection is to use a penalty (correction) factor as the measure of modelcomplexity. The well known criteria are Akaike information criteria (AIC)[1], Bayesian informationcriteria (BIC)[2], generalized cross-validation (GCV)[3], minimum description length(MDL)[4, 5], and risk inflation criteria (RIC)[6]. These methods can be easily fit to linearregression models when the large number of samples is available. However, they suffer thedifficulty to select the optimal structure of the estimation networks in the case of nonlinearregression models and/or the small number of samples. Recently, Vapnik[7] proposed themodel selection method based on the structural risk minimization (SRM) principle. One ofcharacteristics of this method is that the model complexity is described by the structuralinformation such as the VC dimension of the estimation network. This method can be appliedto the nonlinear models and also the regression models using small number of samples.Chapelle et al.[8] and Cherkasky et al.[9] showed that the SRM based model selection is ableto outperform other statistical methods such as AIC or BIC. These methods require theactual VC dimension of the hypothesis space associated with the estimation network, whichis not easy to determine in the case of nonlinear regression models.In this paper, we present the risk function bounds in the sense of the modulus of continuity(MC) representing a measure of continuity for the given function. Lorentz[10] applied the MCto the function approximation theories. In our method, this measure is applied to determinethe risk function bounds in the L 1 sense. For the estimation of these bounds, the MC isanalyzed for both the target and estimation functions. As a result, the suggested boundsinclude both the structural information such as the number of parameters and the learnedresults such as the weight values of the estimation network. Through the simulation forfunction approximation, we have shown that the MC based method can provide the better2

Example: Consider the trigonometric polynomials φ 0 (x) = 1/2, φ 2j−1 (x) = sin jx, andφ 2j (x) = cos jx for j = 1, 2, · · · , n. Applying the mean value theorem to φ j , we getω(φ k , h) ‖φ ′ k‖ ∞ h n · h for k = 0, · · · , 2n.That is,max {ω(φ k, h)} n · h.0k2nTherefore, the upper bounds of risk functions can be determined byR(f n ) L1 R emp (f n ) L1 + ω(f, h) + n · h2n∑k=0|w k | + σ√2δ . (24)Similarly, we can derive the upper bounds of risk functions for the regression model usingalgebraic polynomials, that is, f n (x) = ∑ ni=0 w ix i .Example: Consider the following form of radial basis functions:φ k (x) = exp(− (x − t )k) 22σ 2 kfor k = 1, · · · , nwhere t k and σ 2 k represent the center and width of the basis function φ k respectively.Applying the mean value theorem to φ k , we getω(φ k , h) ‖φ ′ k‖ ∞ h h (exp − 1 )σ k 2for k = 1, · · · , nsince φ ′ k has the maximum and minimum at x = t i − σ k and x = t i + σ k respectively.This implies thatmax {ω(φ k, h)} h (exp − 1 )1kn σ ′ 2whereσ ′ = min1kn σ k.9

Therefore, the upper bound of risk function can be determined byR(f n ) L1 R emp (f n ) L1 + ω(f, h) + h (exp − 1 ) n∑ 2|wσ ′ k | + σ√2δ . (25)k=1Example: Consider the following form of sigmoid functions:φ k (x) =11 + exp(−a i x + b i )for k = 1, · · · , nwhere a k and b k represent the adaptable parameters of the basis function φ k . Applyingthe mean value theorem to φ k , we getw(φ k , h) ‖φ ′ k‖ ∞ h h 4for k = 1, · · · , nsince φ ′ k has the maximum at x = b k/a k . That is,max {ω(φ k, h)} h1kn 4 .Therefore, the upper bound of the risk functions can be determined byR(f n ) L1 R emp (f n ) L1 + ω(f, h) + h 4n∑ 2|w k | + σ√δ . (26)k=1The suggested theorem of true risk bounds is derived in the sense of the L 1 measure, that is,the risk function described by (16). Other model selection criteria such as AIC and SEB arederived from the risk function using the quadratic loss function L(y, f n (x)) = (y −f n (x)) 2 . Inthis context, we can show that the optimal function f n ∗(x) determined from the risk functionof the L 1 measure is also optimal in the sense of the risk function using the quadratic lossfunction under the assumption that the estimation function is unbiased and the error termy − f n (x) has normal distribution.10

4 SimulationThe simulation for function approximation was performed using the regression model withtrigonometric polynomial functions. In this regression, various model selection methods suchas the AIC, BIC, VC dimension, and MC based methods were tested. As the VC dimensionbased method, we choose the smallest eigenvalue based (SEB) method[8] suggested byChapelle et al. Here, the trigonometric polynomial functions were selected as the basis functionssince this network could be considered as the linear regression model so that the VCdimension of the network was easy to find out and the optimization of network parameterscould be easily solved. This makes easier to compare the suggested method with other methods.In this simulation, the data (x i , y i ) were generated by (1). Here, the input value x i wereuniformly distributed within the interval of [−π, π]. As the target functions, the followingfunctions were used:1) the step function defined by⎧⎪⎨ 1 if x 0f(x) =⎪⎩ 0 otherwise,(27)2) the sine square function defined byf(x) = sin 2 (x), (28)3) the sinc function defined by4) and the combined function defined byf(x) = sin(πx)πx , (29)f(x) = 2x exp(− x2) cos(2πx). (30)2The target functions were illustrated in Figure 1. The first target function was discontinuouswhile other functions were continuous. The second and third functions showed the11

similar complexity while the fourth function was more complex than other functions. Forthese target function, we generated N (= 50) training samples randomly according to thetarget function to train the estimation network. We also generated 300 test samples separately.For the estimation network, the basis functions of trigonometric polynomial networkwere given byφ 0 (x) = 1 2 , φ 2j−1(x) = sin jx, and φ 2j (x) = cos jx,where j represents the period of sinusoidal function. Here, the estimation function f n (x) wasgiven byn∑f n (x) = w k φ k (x). (31)k=0For N samples, the observation vector defined by y = (y 1 , · · · , y N ) T can be approximatedby the following vector form:y = Φ n w (32)where Φ n was a matrix in which the ij-th element was given by φ j (x i ) and w was a weightvector defined by w = (w 0 , · · · , w n ) T . From the empirical risk minimization of square lossfunction, the estimated weight vector ŵ could be determined byŵ = (Φ T nΦ n ) −1 Φ T ny. (33)By substituting the estimated weight vector to (32), we obtained the empirical risk R emp (f n )evaluated by the training samples and the estimated risk ̂R(f n ) could be determined by theAIC, BIC, SEB, and MC based methods. Here, the estimated optimal number of nodes wasdetermined bŷn = arg minn̂R(fn ). (34)Note that in the MC method, only the terms of R emp (f n ) and the modulus of continuityfor the estimation function were considered to select the optimal number of nodes sinceother terms in (19) were constant once the training samples were given. To compare the risk12

functions obtained for the estimated optimal number of nodes ̂n with the risk functions forthe minimum number of nodes obtained from the test samples, we computed the log ratioof two risks, that is,r R = logR(f̂n )min n R(f n )(35)where R(f n ) represented the risk function for the squared error loss function L(y, f n ) =(y − f n (x)) 2 . This risk ratio represented the quality of distance between the optimal and theestimated optimal risks. We also computed the log ratio of the estimated optimal number ofnodes ̂n to the minimum number of nodes obtained from the test patterns, that is,r n = loĝnarg min n R(f n ) . (36)This node ratio represented the quality of distance between the optimal and the estimatedoptimal complexity of the network. After all experiments have been repeated 1000 times, therisk ratios of (35) and the node ratios of (36) were plotted using the box-plot method. Thesimulation results of model selection using the AIC, BIC, SEB and MC based methods wereillustrated in Figures 2 through 9. These simulation results showed us that 1) the SEB basedmethod outperformed the AIC and BIC based methods from the view point of risk ratiosin the case of the first, second, and third target functions but not in the case of the fourthfunction, that is, more complicated function, 2) while the MC method demonstrated the toplevel performance from the view points of risk and node ratios for all four target functions.In general, the SEB method showed good performance when the ratio of the optimal numberof nodes to the number of samples n ∗ /l was small since the true risk bounds based on the VCdimension were derived in the sense of uniform convergence, that is, the worst case in thehypothesis space. As an example, we illustrated n ∗ /l for four target functions in Figure 10.As we expected, the fourth target function required high n ∗ /l compared to other targetfunctions due to the complexity of function as shown in Figure 1. In this case, the SEBmethod did not show good performance.13

To see the risk prediction of the model selection methods, we plotted the estimated errorversus the number of nodes for the AIC, BIC, and SEB methods in the L 2 sense of lossfunction, that is, the square root of risk function, and also for the MC method in the L 1sense of loss function defined by (16). The predicted results were compared with test errors asshown in Figures 11 and 12. Theses results showed that 1) the risk prediction using the AICand BIC methods fit well except the sudden change in risk functions, 2) the risk predictionusing the SEB method had the tendency to fit well when the ratio of the number of nodes tothe number of samples n/l was small, and 3) the risk prediction using the MC method waswell suited with the test errors in overall range of the number of nodes. In these predictedresults, it was interesting that the MC method was able to catch the sudden change of testerrors while other methods didn’t.In summary, the performance of the MC based method showed the better performancefor various types of target functions from the view points of risk and node ratios compared toother methods. We also demonstrated that the risk prediction using the MC method was ableto catch the trend of test errors. This was mainly due to the fact that the MC based methodwas performed using the risk function bounds incorporating the information of learned resultssuch as the sum of absolute weights as well as the structural information of the estimationnetwork. Furthermore, the suggested MC method can be easily extended to various types ofregression models with nonlinear kernel functions which have some smoothness constraints.14

10.80.60.40.20−4 −2 0 2 4(a) step10.80.60.40.20−4 −2 0 2 4(b) sin 2 (x)1.510.50−0.5−4 −2 0 2 4(c) sinc(x)1.510.50−0.5−1−1.5−4 −2 0 2 4(d) 2xexp(−x 2 )cos(2π x)Fig. 1. The target functions for the simulation of model selection:15

32.521.510.50AIC BIC SEB MC(a)32.521.510.50AIC BIC SEB MC(b)32.521.510.50AIC BIC SEB MC(c)32.521.510.50AIC BIC SEB MC(d)Fig. 2. The box-plots of risk ratios r R for the regression of step function: (a), (b), (c), and (d) represent the box-plotsof r R with σ = 0, 0.025, 0.05, and 0.1 respectively.16

21.510.50AIC BIC SEB MC(a)21.510.50AIC BIC SEB MC(b)21.510.50AIC BIC SEB MC(c)21.510.50AIC BIC SEB MC(d)Fig. 3. The box-plots of node ratios r n for the regression of step function: (a), (b), (c), and (d) represent the box-plotsof r n with σ = 0, 0.025, 0.05, and 0.1 respectively.17

32.521.510.50AIC BIC SEB MC(a)32.521.510.50AIC BIC SEB MC(b)32.521.510.50AIC BIC SEB MC(c)32.521.510.50AIC BIC SEB MC(d)Fig. 4. The box-plots of risk ratios r R for the regression of sine-square function: (a), (b), (c), and (d) represent thebox-plots of r R with σ = 0, 0.025, 0.05, and 0.1 respectively.18

21.510.50AIC BIC SEB MC(a)21.510.50AIC BIC SEB MC(b)21.510.50AIC BIC SEB MC(c)21.510.50AIC BIC SEB MC(d)Fig. 5. The box-plots of node ratios r n for the regression of sine-square function: (a), (b), (c), and (d) represent thebox-plots of r n with σ = 0, 0.025, 0.05, and 0.1 respectively.19

32.521.510.50AIC BIC SEB MC(a)32.521.510.50AIC BIC SEB MC(b)32.521.510.50AIC BIC SEB MC(c)32.521.510.50AIC BIC SEB MC(d)Fig. 6. The box-plots of risk ratios r R for the regression of sinc function: (a), (b), (c), and (d) represent the box-plotsof r R with σ = 0, 0.025, 0.05, and 0.1 respectively.20

21.510.50AIC BIC SEB MC(a)21.510.50AIC BIC SEB MC(b)21.510.50AIC BIC SEB MC(c)21.510.50AIC BIC SEB MC(d)Fig. 7. The box-plots of node ratios r n for the regression of sinc function: (a), (b), (c), and (d) represent the box-plotsof r n with σ = 0, 0.025, 0.05, and 0.1 respectively.21

32.521.510.50AIC BIC SEB MC(a)32.521.510.50AIC BIC SEB MC(b)32.521.510.50AIC BIC SEB MC(c)32.521.510.50AIC BIC SEB MC(d)Fig. 8. The box-plots of risk ratios r R for the regression of combined function: (a), (b), (c), and (d) represent thebox-plots of r R with σ = 0, 0.025, 0.05, and 0.1 respectively.22

21.510.50AIC BIC SEB MC(a)21.510.50AIC BIC SEB MC(b)21.510.50AIC BIC SEB MC(c)21.510.50AIC BIC SEB MC(d)Fig. 9. The box-plots of node ratios r n for the regression of combined function: (a), (b), (c), and (d) represent thebox-plots of r n with σ = 0, 0.025, 0.05, and 0.1 respectively.23

0.80.60.40.20 0.025 0.05 0.1(a)0.80.60.40.20 0.025 0.05 0.1(b)0.80.60.40.20 0.025 0.05 0.1(c)0.80.60.40.200 0.025 0.05 0.1(d)Fig. 10. The box-plots for the optimal number of nodes to the number of samples n ∗ /l: (a), (b), (c), and (d) representthe box-plots of n ∗ /l in the regression of step, sine-square, sinc, and combined functions respectively.24

10.90.8train errortest errorBICSEBAIC0.70.60.50.40.30.20.100 5 10 15 20 25 30 35 40 45(a)10.9test errortrain errorMC0.80.70.60.50.40.30.20.100 5 10 15 20 25 30 35 40 45(b)Fig. 11. Performance Predictions of model selection methods in the case of sinc function: (a) represents the estimatederror in the L 2 sense versus the number of nodes using the AIC, BIC, and SEB methods and (b) represents theestimated error in the L 1 sense versus the number of nodes using the MC method.25

10.90.8train errortest errorBICSEBAIC0.70.60.50.40.30.20.100 5 10 15 20 25 30 35 40 45(a)10.9test errortrain errorMC0.80.70.60.50.40.30.20.100 5 10 15 20 25 30 35 40 45(b)Fig. 12. Performance Predictions of model selection methods in the case of combined function: (a) represents theestimated error in the L 2 sense versus the number of nodes using the AIC, BIC, and SEB methods and (b) representsthe estimated error in the L 1 sense versus the number of nodes using the MC method.26

5 ConclusionWe have suggested a new method of model selection in regression problems based on themodulus of continuity. The risk function bounds are investigated from the view point of themodulus of continuity for both the target and estimation functions. The final form of the riskfunction bounds incorporate the information of learned results such as the sum of absoluteweights as well as the structural information such as the number of nodes in the regressionmodel. To verify the validity of the suggested bound, the model selection of regression modelsusing the trigonometric polynomials in function approximation problem was performed. Asa result, the suggested method showed the better performance for various types of targetfunctions from the view points of risk and node ratios compared to other model selectionmethods. We also demonstrated that the risk prediction using the MC method was able tocatch the trend of test errors. This was mainly due to the fact that the MC based methodwas performed using the risk function bounds incorporating both the information of learnedresults as well as the structural information of the estimation network, the main criteriaof other model selection methods. Furthermore, the suggested MC method can be easilyextended to various types of regression models with nonlinear kernel functions which havesome smoothness constraints.27

References1. Akaike, H.: Information theory and an extansion of the maximum likelihood principle. In Petrov, B. and Csaki,F., editors, 2nd Interanational Symposium on Information Theory, VOL. 22 (1973), 267–281, Budapest.2. Schwartz, G.: Estimating the dimension of a model. Ann. Stat., 6 (1978),461–464.3. Wahba, G., Golub, G., and Heath, M.: Generalized cross-validation as a method for choosing a good ridge parameter.Technometrics, 21 (1979), 215–223.4. Rissanen, J.: Stochastic complexity and modeling. Annals of Statistics, 14 (1986), 1080–1100.5. Barron, A., Rissanen, J., and Yu, B.: The minimum description length principle in coding and modeling. IEEETransactions on Infomation Theory, 44 (1998), 2743–2760.6. Dean P. Foster and Edward I. George: The risk inflation criterion for multiple regression. Annals of Statistics, 22(1994), 1947–1975.7. Vladimir N. Vapnik: Statistical Learning Theory. J. Wiley, 19988. Olivier Chapelle, Vladimir Vapnik, Yoshua Bengio: <strong>Model</strong> selection for small sample regression. Machine Learning,48 (2002), 315-333, Kluwer Academic Publishers.9. Vladimir Cherkassky, Xuhui Shao, Filip M. Mulier, Vladimir N. Vapnik: <strong>Model</strong> complexity control for regressionusing VC generalization bounds. IEEE transactions on Neural Networks, VOL. 10, NO. 5 (1999),1075–1089.10. G.G. Lorentz: Approximation of functions. Chelsea Publishing Company, New York, 1986.28

AppendixA-1. Proof of Theorem 1The main idea of proving this theorem is to decompose the risk functional R(f n ) L1intofour components: the modulus of continuity for target function w(f, h), the modulus of continuityfor approximation function w(f n , h), additive noise and the empirical risk functional|y i − f n (x i )|. First, we will prove the theorem for noiseless target function, that is, y = f(x).For each x ∈ X, there is x i ∈ X i , i = 1, · · · , N, with |x − x i | < h. This implies that|f(x) − f n (x)| |f(x) − f(x i )| + |f(x i ) − f n (x i )| + |f n (x) − f n (x i )| ω(f, h) + |f(x i ) − f n (x i )| + |f n (x) − f n (x i )|. (37)Let f n (x) = ∑ ni=1 w iφ i (x). For each x, y ∈ X with |x − y| h, we haveThat is,|f n (x) − f n (y)| = |n∑i=1w i (φ i (x) − φ i (y))| max1in |φ i(x) − φ i (y)| max1in {ω(φ i, h)}n∑|w i |i=1n∑|w i |. (38)i=1n∑|f(x) − f n (x)| ω(f, h) + |f(x i ) + f n (x i )| + max {ω(f, h)} |w i | (39)1infrom (37) and (38). By taking the integral of (39) in X i satisfying P (X i ) = 1/N, for i =i=11, · · · , N, we get∫X i|f(x) − f n (x)|dP (x) 1 N(ω(f, h) + |f(x i ) − f n (x i )| + max{ω(φ i , h)})n∑|w i | . (40)i=1By taking the integral of (39) in X, we get∫N∑∫R(f n ) L1 = |f(x) − f n (x)|dP (x) = |f(x) − f n (x)|dP (x)X iXi=1 R emp (f n ) L1 + ω(f, h) + max{ω(φ i , h)}29n∑|w i |. (41)i=1

Now, let us consider the target function with noise, that is, y = f(x)+ɛ, where ɛ representsa random variable with mean zero and variance σ 2 . Then, it follows that|y − f n (x)| |y − y i | + |y i − f n (x i )| + |f n (x i ) − f n (x)| w(f, h) + |ɛ − ɛ i | + |y i − f n (x i )| + |f n (x i ) − f n (x)|. (42)Here, ɛ i , i = 1, · · · , N are random variables with mean zero and variance σ 2 , We also assumethat ɛ and ɛ i are independent. In this case,P {|ɛ − ɛ i | > u} E|ɛ − ɛ i| 2u 2(by Markov’s inequality) 2σ2u 2 (43)Therefore, with the probability at least 1 − δ the following inequality holds:Q. E. D.∫R(f n ) L1 =X×R|y − f n (x)|dP (x, ɛ) R emp (f n ) L1 + ω(f, h) + max{ω(φ i , h)}n∑ 2|w i | + σ√δ . (44)i=130

Model Selection Based on the Modulus of Continuity

Create successful ePaper yourself

Delete template?

Save as template?