Variance Reduction with Monte Carlo Estimates of Error rates in multivariate classification C.Weihs" G.Calzolari Fachbereich Statistik Dipartimento di Statistica Universitat Dortmund Universita degli studi di Firenze M.C.Rohl Konigstein /Ts August 1999 Abstract In this paper,control variates are proposed to speed up Monte Carlo Sim- ulations to estimate expected error rates in multivariate classification. KEY WORDS:classification,control variates,error rate,Monte Carl Simulation,variance reduction 1 Introduction The aim of this paper is to speed up Monte Carlo Simulations applied to multivariate classification.The most interesting performance measure in classification is the misclassification error. In the case of given group densities,there are two possibilities to calculate the error rate:either by numerical integration or by Monte Carlo Simulation which is the only feasible method in higher dimensions.In this paper,we focus on the Monte Carlo error estimate.This approach suffers from the variability of the error rates,because the error rate is a random variable by now.Therefore,every principle to reduce this variance is welcome.In the literature various variance reduction techniques are proposed,among those antithetic variables and control variates(see,e.g.,[1]).Here, *D-44221 Dortmund,Germany,Tel.++49231-755-4363,e-m ail:weihs@amadeus.st atistik.uni- dortmund.de
Variance Reduction with Monte Carlo Estimates of Error Rates in Multivariate Classication C. Weihs Fachbereich Statistik Universitat Dortmund G. Calzolari Dipartimento di Statistica Universita degli studi di Firenze M. C. Rohl Konigstein /Ts. August 1999 Abstract In this paper, control variates are proposed to speed up Monte Carlo Simulations to estimate expected error rates in multivariate classication. KEY WORDS: classication, control variates, error rate, Monte Carlo Simulation, variance reduction 1 Introduction The aim of this paper is to speed up Monte Carlo Simulations applied to multivariate classication. The most interesting performance measure in classication is the misclassication error. In the case of given group densities, there are two possibilities to calculate the error rate: either by numerical integration or by Monte Carlo Simulation which is the only feasible method in higher dimensions. In this paper, we focus on the Monte Carlo error estimate. This approach suers from the variability of the error rates, because the error rate is a random variable by now. Therefore, every principle to reduce this variance is welcome. In the literature various variance reduction techniques are proposed, among those antithetic variables and control variates (see, e.g., [1]). Here, D-44221 Dortmund, Germany, Tel. ++49-231-755-4363, e-mail: weihs@amadeus.statistik.unidortmund.de
2 Multivariate Classification 2 ve will gorcertrate cn ccrtrd variates ard cenastrate their variarce redcticn pCtertial mnar spedal pICblem The paer isggariedas fclove mnsectic2 vewill giveabrief irtrcd cticnto mltantedassitxatic insectic3.vewill propcsetwocierert cortrdvanates whchwill bestiaedardccmaredinsectim4 by nearsosceeamles the paper dcseswthacaduscninsectims. 2 Multivariate Classification Cassifcaticncealsvithtrealccaticncfoiectsto prcetemiredgIcup,2. 4 sa.Treamis tommetre msdlassifiatime (rate cer all possible treallccaticrs giventregrap cersitiespi()(=12,...,9).Tremrial eCr Iateistreso-calledBayes error. weneaued featues (aiales ctreciectsvecarsicer im atart forcisaim raticnbetweentrecjects Thesecanbeccrticis featues(GNP,camticn etc)crasccte(nber dfms nmber d irrabitartsetc). orcetregicub cersitiesaespcafied inacertoririmietheerarrateveallccate ancbject wthfeatuevector tograb i.f pi(r)>p(x)(t). (1) Cassifcaticnnetrcos ctenassunetlegicyp cersitiesp()tobe rcmal rken trereaeat least tvonccelling pcssibiltes(see eg.,21): Etinatetresaneccvariarcenatris fral gcus(DA.lirearcisqimirart aralyss)cr estinatea creert ccvariarcenatrik for cach grap (QDA,qladatic cis cmrart aralyss). ofccuse bcthmetkccsalsocstinatecneertneanvectasfcreachgIcup.inthis papervetakeoDA as treaceiate ardths starcarddassitcaticnpIccecte oftenveacciticrally vart toredcetrecinesicnficm to=i cr2 togrkarce nanperceticn(anersmieccuicn.Treccrstncticndad-saewthmn inal ecrrate glventregicp censtspi()ina space canbeccreby nccem timatcnte(rnlques.Icreamlesimulated,Annealing,[3].Ineacnct step,apIgiecticnspaceisprocsed trenvecetemretregIcib cersties(atrer estimated by neans o tre pIciected cata cr crectly cerved ficmtre pIciected censities d thecngiral spae)[4,ardcalalate tre err ratemn thepicecticn Sace
2 Multivariate Classication 2 we will concentrate on control variates and demonstrate their variance reduction potential in our special problem. The paper is organized as follows: In section 2 we will give a brief introduction to multivariate classication. In section 3, we will propose two dierent control variates which will be studied and compared in section 4 by means of some examples. The paper closes with a conclusion in section 5. 2 Multivariate Classication Classication deals with the allocation of ob jects to g predetermined groups 1; 2;::: ; g, say. The aim is to minimize the misclassication error (rate) over all possible future allocations, given the group densities pi(x) ( i= 1 ;2;::: ;g). The minimal error rate is the so{called Bayes error. We measure d features (variables) of the ob jects we consider important for discrimination between the ob jects. These can be continuous features (GNP, consumption etc.) or discrete (number of rms, number of inhabitants etc.). Once the group densities are specied, in order to minimize the error rate we allocate an ob ject with feature vector x to group i, if pi(x) > pj (x) ( 6=j i): (1) Classication methods often assume the group densities pi(x) to be normal. Then there are at least two modelling possibilities (see, e.g., [2]): Estimate the same covariance matrix for all groups (LDA, linear discriminant analysis) or estimate a dierent covariance matrix for each group (QDA, quadratic discriminant analysis). Of course, both methods also estimate dierent mean vectors for each group. In this paper we take QDA as the adequate, and thus standard classication procedure. Often we additionally want to reduce the dimension from d to d0 = 1 or 2 to enhance human perception (dimension reduction). The construction of a d0{space with minimal error rate, given the group densities pi(x) in d{space, can be done by modern optimization techniques, for example Simulated Annealing [3]. In each optimization step, a pro jection space is proposed. Then we determine the group densities (either estimated by means of the pro jected data or directly derived from the pro jected densities of the original space) [4], and calculate the error rate in the pro jection space
3 Variance Reduction by Control Variates 3 In this paper,we suppose that the prcjection space is fixed,so that we already have the group densities available Of couse,the follcwing approach can be also applied during optimization at each optimization step. 3 Varian Ce Red u Ction by Control variates 3.1 G eneral Ideas What we have to calculate is the error rate given the group densities.In one dimen- sion,this can easily be done by numerical integration,because we only have to find the intersection points of the different group densities(determined by pi)=pi()) and then calculate integrals like pix)dab∈R, (2) where pi)denctes an arbitrary kncwn group density. But in two or more dimensions,the border lines between the group densities do not have that simple shapes,even when we assume equal group ccvariance matrices. Therefore,integration can only be done by means cf a grid in two or more dimen- sions. Ancther possibility to calculate the error rate is Monte Carlo Simulation.We gen- erate random realizations from the group densities and allocate them according to our classification rule (1).This approach suffers from the variability of the error rates,because the error rate is a random variable by ncw. In order to reduce the Monte Carlo variance of the error rate we introduce control variates (cv).The okject cf interest is the error rate errcr.We write this in a more complicated but helpful way as errar=erra+(erra-erra) (3) with a new random variable erra.We want to compute the expectation cf these error rates E(erra)=E(err)+E(rra-erra). (4) The idea behind control variates is to choose a random variable erra so that we can calculate E(erra)exactly (no variance)and error and errorG are positively correlated.So a sensible way cf estimating E(errcr)would be EG(erra)=E(Err)+E(rrar-erra) (5)
3 Variance Reduction by Control Variates 3 In this paper, we suppose that the pro jection space is xed, so that we already have the group densities available. Of course, the following approach can be also applied during optimization at each optimization step. 3 Variance Reduction by Control Variates 3.1 General Ideas What we have to calculate is the error rate given the group densities. In one dimension, this can easily be done by numerical integration, because we only have to nd the intersection points of the dierent group densities (determined by pi(x) = pj (x)) and then calculate integrals like Z b a pi(x) dx a; b 2 R; (2) where pi(x) denotes an arbitrary known group density. But in two or more dimensions, the borderlines between the group densities do not have that simple shapes, even when we assume equal group covariance matrices. Therefore, integration can only be done by means of a grid in two or more dimensions. Another possibility to calculate the error rate is Monte Carlo Simulation. We generate random realizations from the group densities and allocate them according to our classication rule (1). This approach suers from the variability of the error rates, because the error rate is a random variable by now. In order to reduce the Monte Carlo variance of the error rate we introduce control variates (cv). The ob ject of interest is the error rate error. We write this in a more complicated but helpful way as error = errorcv + ( error errorcv) (3) with a new random variable errorcv . We want to compute the expectation of these error rates E(error) = E( errorcv) + E( error errorcv): (4) The idea behind control variates is to choose a random variable errorcv so that we can calculate E(errorcv) exactly (no variance) and error and errorcv are positively correlated. So a sensible way of estimating E(error) would be E^ cv(error) = E( errorcv) + E( ^ error errorcv); (5)
3.2 Two Specific Control Variates 4 where the first term on the right hand side has no variance and the second term is computed as the sample mean of Monte-Carlo replicates.Then the variance of the right hand side of(4)is Var(error -errorc)/N, (6) where N is the sample size to determine error and errorco,and Var(error -errorc)=Var(erron+Var(erroro) -2Cov(error,errorc). (7) Now it becomes clear that a large positive correlation between error and errorc can reduce the variance compared to the "naive"estimator EMc(error),i.e.the sample mean of Monte Carlo replicates of error with variance Var(EMc(error))= Var(error)/N.We can even do better.We can use the equation E(error)=aE(errorc)+E(error-aerrorcu) (8) to select that parameter a that minimizes the variance Var(error-aerrorc), (9) leading to Q= Cov(error,errorcu) Var(errore) (10) which is almost equal to the correlation coefficient owhen Var(error)Var(errorco) holds.The final result is then min Var(error-aerrora)(1-2)Var(error), (11) i.e.there can always be a gain when o0. Considering the above arguments,what we look for as a control variate procedure is any classification method which gives results as much as possible correlated with QDA,and for which the exact expected error rate is easily computable.Moreover, one should avoid control variates for which the additional computational effort is that high that overall computation time is increased even in the case of variance reduction. 3.2 Two Specific Control Variates What is,then,a suitable control variate in our context?We will discuss two possi- bilities.In both cases the control variate procedure assumes a somewhat simplified problem situation to be true in order to simplify the Monte Carlo procedure.In the first procedure the covariance matrices of the different groups are assumed to be identical.In the second procedure the possibly high dimensional problem is optimally projected to one dimension.Note that we assumed to study problems with normal group densities with individual covariance matrices.Thus,QDA was assumed to be the standard classification method
3.2 Two Specic Control Variates 4 where the rst term on the right hand side has no variance and the second term is computed as the sample mean of Monte-Carlo replicates. Then the variance of the right hand side of (4) is Var(error errorcv)=N ; (6) where N is the sample size to determine error and errorcv , and Var(error errorcv) = Var( error) + Var( errorcv) 2 Cov( error; errorcv): (7) Now it becomes clear that a large positive correlation between error and errorcv can reduce the variance compared to the "naive" estimator E^ MC (error), i.e. the sample mean of Monte Carlo replicates of error with variance Var(E^ MC (error)) = Var(error)=N. We can even do better. We can use the equation E(error) = E(errorcv) + E( error errorcv) (8) to select that parameter that minimizes the variance Var(error errorcv); (9) leading to = Cov(error; errorcv) Var(errorcv) (10) which is almost equal to the correlation coecient % when Var(error) Var(errorcv) holds. The nal result is then min Var(error errorcv) (1 % 2)Var(error); (11) i.e. there can always be a gain when % 6= 0. Considering the above arguments, what we look for as a control variate procedure is any classication method which gives results as much as possible correlated with QDA, and for which the exact expected error rate is easily computable. Moreover, one should avoid control variates for which the additional computational eort is that high that overall computation time is increased even in the case of variance reduction. 3.2 Two Specic Control Variates What is, then, a suitable control variate in our context? We will discuss two possibilities. In both cases the control variate procedure assumes a somewhat simplied problem situation to be true in order to simplify the Monte Carlo procedure. In the rst procedure the covariance matrices of the dierent groups are assumed to be identical. In the second procedure the possibly high dimensional problem is optimally pro jected to one dimension. Note that we assumed to study problems with normal group densities with individual covariance matrices. Thus, QDA was assumed to be the standard classication method
3.2 Two Specific Control Variates 5 1.The first idea is to utilize the error rate computed by LDA as erroree based on the assumption of equal covariance matrices for all the groups.The error rate error is calculated by QDA from N random realizations drawn from the group densities.To get Emc(error)we generate W such error rates and average.Therefore we used N x W random vectors.Now we take the same random vectors and apply LDA with the same,so-called pooled,covariance matrix for all groups to calculate error rates erroree.If Ei is the assumed covariance matrix for group i,then ((N-1))/(N-g)is the pooled covariance matrix,where M is the number of realizations in group i.The differences of the w corresponding estimates error and errorc are used to calculate E(error-errore).At last we calculate E(errorce)in an exact manner (so that we have no variance)by numerical integration based on the densities with pooled covariance matrices.We now have all the ingredients we need for an efficiency comparison with the naive Monte Carlo estimator.The variance of the naive estimator is calculated by the sample of size W of the estimated error rates error and the variance of the control variate estimator by the sample of size W of(error-error).This approach has the drawback that we have to calculate an exact integral in a projection space which might be two dimensional or of even higher dimension with rather ugly borderlines. 2.A second possibility to determine the error rate error is to use another con- trol variate:the error rate of an "optimal"one dimensional projection.This can be obtained by the largest eigenvalue and the corresponding eigenvector of QDA in the original space or by direct minimization of the error rate.We do the same as in 1 to obtain Evc(error).But then we project the random vectors onto the optimally discriminating direction taking into account the different covariance structures and build the differences of corresponding er- ror estimates to compute E(error-error).Now,the exact calculation of E(error)is simply a one dimensional integration with clearcut intersection points.This speeds up the procedure compared to 1.To construct the opti- mally discriminating one dimensional projection we follow an idea in [5]where it was proposed to project on the first eigenvector v of MMT,where M=(h-4,,2-41,2g-1,,2-) (12) where the i are the group means and the Ei are (again)the group covariance matrices,i=1,...,g.The projected means,variances and feature vectors then have the form:=i=ofiv and x*=of'r. In order to represent adequate control variates the additional computation time of procedures 1 and 2 have to be small relative to the computation time of naive Monte Carlo.That this is the case not considering the computation of the exact expected error rates should be clear by the following arguments. Naive Monte Carlo estimates the means and the covariance matrices of the groups,and evaluates the corresponding estimated group densities for each observation
3.2 Two Specic Control Variates 5 1. The rst idea is to utilize the error rate computed by LDA as errorcv based on the assumption of equal covariance matrices for all the groups. The error rate error is calculated by QDA from N random realizations drawn from the group densities. To get E^ MC (error) we generate W such error rates and average. Therefore we used N W random vectors. Now we take the same random vectors and apply LDA with the same, so-called pooled, covariance matrix for all groups to calculate error rates errorcv. If i is the assumed covariance matrix for group i, then (Pg i=1(Ni 1)i)=(N g) is the pooled covariance matrix, where Ni is the number of realizations in group i. The dierences of the W corresponding estimates error and errorcv are used to calculate E( ^ error errorcv). At last we calculate E(errorcv) in an exact manner (so that we have no variance) by numerical integration based on the densities with pooled covariance matrices. We now have all the ingredients we need for an eciency comparison with the naive Monte Carlo estimator. The variance of the naive estimator is calculated by the sample of size W of the estimated error rates error and the variance of the control variate estimator by the sample of size W of (error errorcv). This approach has the drawback that we have to calculate an exact integral in a pro jection space which might be two dimensional or of even higher dimension with rather ugly borderlines. 2. A second possibility to determine the error rate error is to use another control variate: the error rate of an "optimal" one dimensional pro jection. This can be obtained by the largest eigenvalue and the corresponding eigenvector of QDA in the original space or by direct minimization of the error rate. We do the same as in 1 to obtain E^ MC (error). But then we pro ject the random vectors onto the optimally discriminating direction taking into account the dierent covariance structures and build the dierences of corresponding error estimates to compute E( ^ error errorcv). Now, the exact calculation of E(errorcv) is simply a one dimensional integration with clearcut intersection points. This speeds up the procedure compared to 1. To construct the optimally discriminating one dimensional pro jection we follow an idea in [5] where it was proposed to pro ject on the rst eigenvector v1 of MMT , where M = ( g 1; :::; 2 1; g 1; :::; 2 1) (12) where the i are the group means and the i are (again) the group covariance matrices, i = 1, ..., g. The pro jected means, variances and feature vectors then have the form: i = v T 1 i, i = v T 1 iv1 and x = v T 1 x . In order to represent adequate control variates the additional computation time of procedures 1 and 2 have to be small relative to the computation time of naive Monte Carlo. That this is the case not considering the computation of the exact expected error rates should be clear by the following arguments. Naive Monte Carlo estimates the means and the covariance matrices of the groups, and evaluates the corresponding estimated group densities for each observation.
3.3 Exact expected error rates 6 Procedure 1 additionally needs to compute the mean of the estimated covari- ance matrices of the groups,and to evaluate group densities for each observa- tion corresponding to the pooled covariance matrix in each group. Procedure 2 additionally computes the'difference matrix'M,its first eigenvec- tor v,and the corresponding projections of the group means and covariance matrices,and evaluates the corresponding 1D normal densities in each pro- jected observation. Therefore,in procedures 1 and 2 the preparation'of the density evaluation does not depend on the number of observations,resulting in a much smaller additional preparation time'than the preparation time for naive Monte Carlo for big numbers of observations.Moreover,in procedure 2 also the additional density evaluations are much quicker than the density evaluations in naive Monte Carlo,since they are in 1D. In procedure 1 the exact expected error rates have to be calculated numerically,in general.For the exact expected error rates in procedure 2,however,an analytic formula can be derived,even.This will be done in the next subsection.In section 4 we will demonstrate the differences between procedures 1 and 2 by some examples. 3.3 Exact expected error rates In procedure 2 exact expected error rates have to be calculated for univariate normal projected distributions.In this case a general formula for the exact expected error rate could be given depending on the intersection points of the univariate normal densities corresponding to the projected group means and variances.In order to illustrate the idea,let us discuss the 2 and 3 groups cases.Moreover,let us assume equal a-priori probabilities 1/g for all g groups for simplicity.In the simulations in the following sections,we also will discuss these cases only. In the case of 2 groups let the intersection point of the two normal densities be 12. Then,obviously,the exact expected error rate corresponding to these densities is (cp.figure 1) E(erTorc)=(1-Φ1(x12)+④2(x12)/2) (13) where is the normal distribution with mean to the left of z12,and 2 the distri- bution with mean to the right. In the case of 3 groups let the distribution indices again be chosen so that a lower index indicates a lower mean.We are now interested in the relative location of the intersection points of the 3 densities.The error rate of the leftmost group 1 is determined by the first intersection on the right hand side with one of the densities of the other groups.For the rightmost group 3 the same is true for the densities on the left.The error rate of the middle group 2 is,correspondingly,determined by
3.3 Exact expected error rates 6 Procedure 1 additionally needs to compute the mean of the estimated covariance matrices of the groups, and to evaluate group densities for each observation corresponding to the pooled covariance matrix in each group. Procedure 2 additionally computes the 'dierence matrix' M, its rst eigenvector v1, and the corresponding pro jections of the group means and covariance matrices, and evaluates the corresponding 1D normal densities in each projected observation. Therefore, in procedures 1 and 2 the 'preparation' of the density evaluation does not depend on the number of observations, resulting in a much smaller additional 'preparation time' than the preparation time for naive Monte Carlo for big numbers of observations. Moreover, in procedure 2 also the additional density evaluations are much quicker than the density evaluations in naive Monte Carlo, since they are in 1D. In procedure 1 the exact expected error rates have to be calculated numerically, in general. For the exact expected error rates in procedure 2, however, an analytic formula can be derived, even. This will be done in the next subsection. In section 4 we will demonstrate the dierences between procedures 1 and 2 by some examples. 3.3 Exact expected error rates In procedure 2 exact expected error rates have to be calculated for univariate normal pro jected distributions. In this case a general formula for the exact expected error rate could be given depending on the intersection points of the univariate normal densities corresponding to the pro jected group means and variances. In order to illustrate the idea, let us discuss the 2 and 3 groups cases. Moreover, let us assume equal a-priori probabilities 1=g for all g groups for simplicity. In the simulations in the following sections, we also will discuss these cases only. In the case of 2 groups let the intersection point of the two normal densities be x12. Then, obviously, the exact expected error rate corresponding to these densities is (cp. gure 1) E(errorcv) = ((1 1(x12)+2(x12))=2) (13) where 1 is the normal distribution with mean to the left of x12, and 2 the distribution with mean to the right. In the case of 3 groups let the distribution indices again be chosen so that a lower index indicates a lower mean. We are now interested in the relative location of the intersection points of the 3 densities. The error rate of the leftmost group 1 is determined by the rst intersection on the right hand side with one of the densities of the other groups. For the rightmost group 3 the same is true for the densities on the left. The error rate of the middle group 2 is, correspondingly, determined by
4 Simulations 7 Figure 1:The error rate of the left group is gray shaded. the first intersections points of its density with the other densities on the left and on the right.For simplicity let us now assume that the relevant intersection points are 12,determining the error of group 1 and the 'left part'of the error of group 2, and x23,determining the error of group 3 and the'right part'of the error of group 2.This then leads to the following formula for the exact error rate corresponding to the 3 groups: E(erToreu)=(1-Φ(c12)+(Φ2(x12)+(1-④2(x23)+Φ3(x23)/3) (14) As an example consider 3 groups with group means =-3,u2=-2,and u3 0,and with standard deviations o1 2.037,02=0.9406,ando3 1.These parameters lead to intersection points 12 =-3.17 and x23 =-1,as well as an exact error rate E(erroree)=31.45%. For procedure 1 we only succeeded to find a general analytic formula for the exact expected error rate in the case of 2 groups.Procedure 1 assumes equal covariance matrices for all groups.This covariance matrix is estimated by the pooled covariance matrix∑=(∑1+∑2)/2,where:is the estimated covariance matrix of group i, i=1 2.For normal group distributions with means and u2 and a common covariance matrix one can show (see [6,p.12)that the exact error rate is E(erroTq)=Φ(-0.5d2)where 012=V(u2-h)T∑-1(u2-】 (15)) and is the distribution function of the standard normal. 4 Simulations 4.1 Known densities In this subsection we assume that the group densities are fully known so that pa- rameter estimation is superfluous.This means in particular that QDA as well as LDA is carried out with the correct densities.In this way the outcome does not depend on the goodness of parameter estimation.In the next subsection,we will discuss the case when density parameters have to be estimated. In all examples sample size N=100 is used for each group.Also,W =100 is used.In order to be independent of the drawn random vectors,this experiment was repeated V=100 times and the means of the meanerror rates and the corresponding
4 Simulations 7 Figure 1: The error rate of the left group is gray shaded. the rst intersections points of its density with the other densities on the left and on the right. For simplicity let us now assume that the relevant intersection points are x12, determining the error of group 1 and the 'left part' of the error of group 2, and x23, determining the error of group 3 and the 'right part' of the error of group 2. This then leads to the following formula for the exact error rate corresponding to the 3 groups: E(errorcv) = ((1 1(x12)) + (2(x12) + (1 2(x23)) + 3(x23))=3) (14) As an example consider 3 groups with group means 1 = 3, 2 = 2, and 3 = 0, and with standard deviations 1 = 2 :037,2 = 0 :9406, and3 = 1. These parameters lead to intersection points x12 = 3:17 and x23 = 1, as well as an exact error rate E(errorcv) = 31 :45%. For procedure 1 we only succeeded to nd a general analytic formula for the exact expected error rate in the case of 2 groups. Procedure 1 assumes equal covariance matrices for all groups. This covariance matrix is estimated by the pooled covariance matrix =( ^ ^ 1 + ^ 2)=2 , where ^ i is the estimated covariance matrix of group i, i = 1 2. For normal group distributions with means ; 1 and 2 and a common covariance matrix one can show (see [6], p. 12) that the exact error rate is E(errorcv) = ( 0:512) where 12 = p (2 1)T 1(2 1) (15) and is the distribution function of the standard normal. 4 Simulations 4.1 Known densities In this subsection we assume that the group densities are fully known so that parameter estimation is super uous. This means in particular that QDA as well as LDA is carried out with the correct densities. In this way the outcome does not depend on the goodness of parameter estimation. In the next subsection, we will discuss the case when density parameters have to be estimated. In all examples sample size N = 100 is used for each group. Also, W = 100 is used. In order to be independent of the drawn random vectors, this experiment was repeated V = 100 times and the means of the mean error rates and the corresponding
4.1 Known densities ⊙ standard deviations as well as the correlation coefficients will be reported in what follow s. First Simulation: First we compare procedures 1 and 2 using two groups with the following parameters of normal distributions: h=(0,0)'and (16) as well as 2=(2,0)'and 22 (17) The true expected error is approximately 14.97%calculated by exact integration to be able to judge the following results. By means of the naive Monte Carlo estimator we obtain Euc(error)=(15.00±2.53)% (18) where 2.53%is the estimated standard deviation of Evc(error).Obviously,the bias is negligible. With procedure 2 one obtains E(error-errorcu)=(-0.92+1.65)% (19) and E(errore)equals 15.87%(exact integration).Summing up for the right hand side of equation(5)we arrive at(14.95+1.65)%.This expression shows a distinctly lower variance than(18).The mean estimated correlation coefficient ist o=0.79.The lowest standard deviation we can get by (8)is therefore 1.55%.This corresponds to a variance reduction of more than 60%in relation to the naive Monte Carlo. Moreover,with procedure 1 one obtains E(error-errorc)=(-0.02+0.61)% (20) and E(errorco)equals 15.08%(exact integration).Summing up for the right hand side of equation (5)we arrive at (15.0610.61)%.This expression shows an even much lower variance than with procedure 2.This indicates that LDA is a very adequate method for this example.Indeed,the mean estimated correlation coefficient is o= 0.97. Second Simulation: Now we compare procedures 1 and 2 by an example with three different groups which do not separate that nicely as in the previous simulation.In addition to the
4.1 Known densities 8 standard deviations as well as the correlation coecients will be reported in what follows. First Simulation: First we compare procedures 1 and 2 using two groups with the following parameters of normal distributions: 1 = (0 ;0)0 and 1 = 1 0 0 1 (16) as well as 2 = (2 ;0)0 and 2 = 1 0 :5 0:5 1 : (17) The true expected error is approximately 14:97% calculated by exact integration to be able to judge the following results. By means of the naive Monte Carlo estimator we obtain E^ MC (error) = (15:00 2:53)% (18) where 2:53% is the estimated standard deviation of E^ MC (error). Obviously, the bias is negligible. With procedure 2 one obtains E( ^ error errorcv)=( 0:92 1:65)% (19) and E(errorcv) equals 15:87% (exact integration). Summing up for the right hand side of equation (5) we arrive at (14:95 1:65)%. This expression shows a distinctly lower variance than (18).The mean estimated correlation coecient ist % = 0 :79. The lowest standard deviation we can get by (8) is therefore 1:55%. This corresponds to a variance reduction of more than 60% in relation to the naive Monte Carlo. Moreover, with procedure 1 one obtains E( ^ error errorcv)=( 0:02 0:61)% (20) and E(errorcv) equals 15:08% (exact integration). Summing up for the right hand side of equation (5) we arrive at (15:060:61)%. This expression shows an even much lower variance than with procedure 2. This indicates that LDA is a very adequate method for this example. Indeed, the mean estimated correlation coecient is % = 0:97. Second Simulation: Now we compare procedures 1 and 2 by an example with three dierent groups which do not separate that nicely as in the previous simulation. In addition to the
4.1 Known densities 9 Sim naiveMC procl proc2 varl var2 cor2 minl min2 mvar1 mvar2 1 2.53 0.61 1.65 94% 57% 0.97 0.79 0.61 1.55 94% 62% 2 2.55 2.31 1.96 18% 41% 0.59 0.70 2.06 1.82 35% 49% 2.55 1.93 2.45 43% 8% 0.74 0.43 1.72 2.30 55% 9% Table 1:Monte Carlo standard deviations,variance reductions,and correlations for known densities two groups in the first simulation we use a third group with the follow ing parameters of a normal distribution: 0.3 s=(3,0)'and (21) The true expected error rate is approximately 28.44% The results of naive Monte Carlo and the two control variate procedures are sum- marized in Table 1.Note that 'Sim'indicates the simulation number,'naiveMC'the estimated mean standard deviation of the naive Monte Carlo,'procl'and proc2' the corresponding standard deviations of the control variate procedures,'varl'and var2'the corresponding percentages of variance reduction,'corl'and 'cor2'the mean correlation coefficients,'minl'and 'min2'the corresponding minimal standard devi- ations of the control variate procedures,and'mvarl'and'mvar2'the corresponding maximum percentages of variance reduction. Analysing Table 1 note particularly that for simulation 2 procedure 2 leads to a big- ger variance reduction than procedure 1,but that the maximum variance reduction is nevertheless smaller than for simulation 1 since the univariate projected constel- lation of the groups is more complicated in this example.The bad performance of procedure 1 indicates that in this example the covariance matrices of the different groups can not be assumed to be approximately equal. Third Simulation: Upto now,the examples were mainly one dimensional in that the groups were shifted in the first component only.Since this might lead to an overoptimistic judgement of procedure 2,the third example is the same as the second,but with 3=(1,1) (22) i.e.with a mean shifted in both directions The corresponding Monte Carlo results can also be found in Table 1.Note in partic- ular that now again procedure 1 is very adequate,but procedure 2,unfortunately, does not lead to a substantial variance reduction,and might thus even cause an increase in computer time.The problems of procedure 2 also become clear noting that the exact expected error rate is 45%for this procedure in contrast to a true expected error rate of around 33%
4.1 Known densities 9 Sim naiveMC proc1 proc2 var1 var2 cor1 cor2 min1 min2 mvar1 mvar2 1 2.53 0.61 1.65 94% 57% 0.97 0.79 0.61 1.55 94% 62% 2 2.55 2.31 1.96 18% 41% 0.59 0.70 2.06 1.82 35% 49% 3 2.55 1.93 2.45 43% 8% 0.74 0.43 1.72 2.30 55% 9% Table 1: Monte Carlo standard deviations, variance reductions, and correlations for known densities two groups in the rst simulation we use a third group with the following parameters of a normal distribution: 3 = (3 ;0)0 and 3 = 2 0:3 0:3 2 :4 : (21) The true expected error rate is approximately 28:44%. The results of naive Monte Carlo and the two control variate procedures are summarized in Table 1. Note that 'Sim' indicates the simulation number, 'naiveMC' the estimated mean standard deviation of the naive Monte Carlo, 'proc1' and 'proc2' the corresponding standard deviations of the control variate procedures, 'var1' and 'var2' the corresponding percentages of variance reduction, 'cor1' and 'cor2' the mean correlation coecients, 'min1' and 'min2' the corresponding minimal standard deviations of the control variate procedures, and 'mvar1' and 'mvar2' the corresponding maximum percentages of variance reduction. Analysing Table 1 note particularly that for simulation 2 procedure 2 leads to a bigger variance reduction than procedure 1, but that the maximum variance reduction is nevertheless smaller than for simulation 1 since the univariate pro jected constellation of the groups is more complicated in this example. The bad performance of procedure 1 indicates that in this example the covariance matrices of the dierent groups can not be assumed to be approximately equal. Third Simulation: Up to now, the examples were mainly one dimensional in that the groups were shifted in the rst component only. Since this might lead to an overoptimistic judgement of procedure 2, the third example is the same as the second, but with 3 = (1 ;1)0 (22) i.e. with a mean shifted in both directions. The corresponding Monte Carlo results can also be found in Table 1. Note in particular that now again procedure 1 is very adequate, but procedure 2, unfortunately, does not lead to a substantial variance reduction, and might thus even cause an increase in computer time. The problems of procedure 2 also become clear noting that the exact expected error rate is 45% for this procedure in contrast to a true expected error rate of around 33%
4.2 Estimated densities 10 As a preliminary conclusion this indicates that procedure 2 is useful probably only if a 1-dimensional projection does not alter the problem too much.Naturally,a similar statement is true for procedure 1,but the assumption of equal,or at least similar,covariance structures is,most of the time,not that much problematic.That the amount of variance reduction depends on the 'similarity'of the control variate to the true error rate could have already been deduced from equation (11).On the other hand,for procedure 1 the exact expected error rate is not easily computable, in general,and procedure 2 is much quicker. 4.2 Estimated densities Since in practice the distributions of the grouped observations are not known,the implementation of the control variate procedures has somewhat to be adapted.For the purpose of this paper we nevertheless assume normal distributions for conve- nience so that only the corresponding distribution parameters have to be estimated. Moreover,since the densities have to be estimated from the observations,the true expected misclassificationrate has to be estimated by means of a resampling method in order to avoid overoptimism.As the resampling method we use leave-one-out cross validation.I.e.we preliminarily eliminate one observation,estimate the densities from the remaining observations,and predict the class of the eliminated observation by means of these estimated densities.This is done for each observation. This causes two problems.First,the extra loop for resampling leads to such a big computational effort that the number of replicates of the whole Monte Carlo experiment is reduced to V 10 for simulation 1 and to V=5 otherwise.Sec- ond,the exact expectations in procedures 1 and 2 should not have to be computed for each resampled sample.Thus,we propose to compute the exact expectation from the "observed"sample only,and use this value for all resampled samples also. Moreover,for the purpose of the simulations for this paper we decided to use the exact expectations from the densities used to generate the observations in order to reduce computational effort.Finally,we used the same example densities as in the preceding section to be able to judge the extra variance caused by parameter estimation. The results of the Monte Carlo simulations can be found in Table 2.Note the increase of variance and the very small correlation %with procedure 2 in the third simulation.The optimal standard deviation reachable by(8)in this case would thus be 2.53,which is only very slightly lower than 2.57 with the naive Monte Carlo. Since,nevert heless,the results are very similar to the results of the simulations with known densities,the conclusions from the last subsection appear to be valid also in the case of density parameters to be estimated
4.2 Estimated densities 10 As a preliminary conclusion this indicates that procedure 2 is useful probably only if a 1-dimensional pro jection does not alter the problem too much. Naturally, a similar statement is true for procedure 1, but the assumption of equal, or at least similar, covariance structures is, most of the time, not that much problematic. That the amount of variance reduction depends on the 'similarity' of the control variate to the true error rate could have already been deduced from equation (11). On the other hand, for procedure 1 the exact expected error rate is not easily computable, in general, and procedure 2 is much quicker. 4.2 Estimated densities Since in practice the distributions of the grouped observations are not known, the implementation of the control variate procedures has somewhat to be adapted. For the purpose of this paper we nevertheless assume normal distributions for convenience so that only the corresponding distribution parameters have to be estimated. Moreover, since the densities have to be estimated from the observations, the true expected misclassication rate has to be estimated by means of a resampling method in order to avoid overoptimism. As the resampling method we use leave-one-out cross validation. I.e. we preliminarily eliminate one observation, estimate the densities from the remaining observations, and predict the class of the eliminated observation by means of these estimated densities. This is done for each observation. This causes two problems. First, the extra loop for resampling leads to such a big computational eort that the number of replicates of the whole Monte Carlo experiment is reduced to V = 10 for simulation 1 and to V = 5 otherwise. Second, the exact expectations in procedures 1 and 2 should not have to be computed for each resampled sample. Thus, we propose to compute the exact expectation from the "observed" sample only, and use this value for all resampled samples also. Moreover, for the purpose of the simulations for this paper we decided to use the exact expectations from the densities used to generate the observations in order to reduce computational eort. Finally, we used the same example densities as in the preceding section to be able to judge the extra variance caused by parameter estimation. The results of the Monte Carlo simulations can be found in Table 2. Note the increase of variance and the very small correlation % with procedure 2 in the third simulation. The optimal standard deviation reachable by (8) in this case would thus be 2.53, which is only very slightly lower than 2.57 with the naive Monte Carlo. Since, nevertheless, the results are very similar to the results of the simulations with known densities, the conclusions from the last subsection appear to be valid also in the case of density parameters to be estimated