TEST(2016)25:197-227 D0I10.1007/s11749-016-0481-7 INVITED PAPER A random forest guided tour Gerard Biaul.2.Erwan Scornet! Published online:19 April 2016 Sociedad de Estadistica e Investigacion Operativa 2016 Abstract The random forest algorithm,proposed by L.Breiman in 2001,has been extremely successful as a general-purpose classification and regression method.The approach,which combines several randomized decision trees and aggregates their pre- dictions by averaging,has shown excellent performance in settings where the number of variables is much larger than the number of observations.Moreover,it is versatile enough to be applied to large-scale problems,is easily adapted to various ad hoc learn- ing tasks,and returns measures of variable importance.The present article reviews the most recent theoretical and methodological developments for random forests.Empha- sis is placed on the mathematical forces driving the algorithm,with special attention given to the selection of parameters,the resampling mechanism,and variable impor- tance measures.This review is intended to provide non-experts easy access to the main ideas. Keywords Random forests.Randomization.Resampling.Parameter tuning. Variable importance This invited paper is discussed in comments available at:doi:10.1007/s11749-016-0482-6; doi:10.1007/s11749-016-0483-5:doi:10.1007/s11749-016-0484-4:doi:10.1007/s11749-016-0485-3: doi:10.1007/s11749-016-0487-1. ☒Gerard Biau gerard.biau@upmc.fr Erwan Scornet erwan.scornet@upmc.fr 1 Sorbonne Universites,UPMC Univ Paris 06.CNRS,Laboratoire de Statistique Theorique et Appliquees(LSTA).boite 158.4 place Jussieu,75005 Paris,France Institut universitaire de France,Paris,France 2Springer
TEST (2016) 25:197–227 DOI 10.1007/s11749-016-0481-7 INV ITED PAPER A random forest guided tour Gérard Biau1,2 · Erwan Scornet1 Published online: 19 April 2016 © Sociedad de Estadística e Investigación Operativa 2016 Abstract The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas. Keywords Random forests · Randomization · Resampling · Parameter tuning · Variable importance This invited paper is discussed in comments available at: doi:10.1007/s11749-016-0482-6; doi:10.1007/s11749-016-0483-5; doi:10.1007/s11749-016-0484-4; doi:10.1007/s11749-016-0485-3; doi:10.1007/s11749-016-0487-1. B Gérard Biau gerard.biau@upmc.fr Erwan Scornet erwan.scornet@upmc.fr 1 Sorbonne Universités, UPMC Univ Paris 06, CNRS, Laboratoire de Statistique Théorique et Appliquées (LSTA), boîte 158, 4 place Jussieu, 75005 Paris, France 2 Institut universitaire de France, Paris, France 123
198 G.Biau,E.Scornet Mathematics Subject Classification 62G02 1 Introduction To take advantage of the sheer size of modern data sets,we now need learning algo- rithms that scale with the volume of information,while maintaining sufficient statistical efficiency.Random forests,devised by Breiman in the early 2000s(Breiman 2001), are part of the list of the most successful methods currently available to handle data in these cases.This supervised learning procedure,influenced by the early work of Amit and Geman (1997),Ho (1998),and Dietterich (2000),operates according to the simple but effective"divide and conquer"principle:sample fractions of the data, grow a randomized tree predictor on each small piece,then paste (aggregate)these predictors together. What has greatly contributed to the popularity of forests is the fact that they can be applied to a wide range of prediction problems and have few parameters to tune. Aside from being simple to use,the method is generally recognized for its accuracy and its ability to deal with small sample sizes and high-dimensional feature spaces.At the same time,it is easily parallelizable and has,therefore,the potential to deal with large real-life systems.The corresponding R package randomForest can be freely downloaded on the CRAN web site (http://www.r-project.org),while a MapReduce (Jeffrey and Sanja 2008)open source implementation called Partial Decision Forests is available on the Apache Mahout website at https://mahout.apache.org.This allows the building of forests using large data sets as long as each partition can be loaded into memory. The random forest methodology has been successfully involved in various prac- tical problems,including a data science hackathon on air quality prediction (http:/ www.kaggle.com/c/dsg-hackathon),chemoinformatics (Svetnik et al.2003),ecol- ogy (Prasad et al.2006;Cutler et al.2007).3D object recognition (Shotton et al. 2011),and bioinformatics (Diaz-Uriarte and de Andres 2006),just to name a few. Howard (Kaggle)and Bowles (Biomatica)claim in Howard and Bowles (2012) that ensembles of decision trees-often known as "random forests"-have been the most successful general-purpose algorithm in modern times,while Varian,Chief Economist at Google,advocates in Varian (2014)the use of random forests in econometrics. On the theoretical side,the story of random forests is less conclusive and,despite their extensive use,little is known about the mathematical properties of the method. The most celebrated theoretical result is that of Breiman(2001),which offers an upper bound on the generalization error of forests in terms of correlation and strength of the individual trees.This was followed by a technical note(Breiman 2004),which focuses on a stylized version of the original algorithm(see also Breiman 2000a,b).A critical step was subsequently taken by Lin and Jeon(2006),who highlighted an interesting connection between random forests and a particular class of nearest neighbor predic- tors,further developed by Biau and Devroye(2010).In recent years,various theoretical studies have been performed(e.g.,Meinshausen 2006;Biau et al.2008;Ishwaran and Kogalur 2010;Biau 2012;Genuer 2012;Zhu et al.2015),analyzing more elaborate Springer
198 G. Biau, E. Scornet Mathematics Subject Classification 62G02 1 Introduction To take advantage of the sheer size of modern data sets, we now need learning algorithms that scale with the volume of information, while maintaining sufficient statistical efficiency. Random forests, devised by Breiman in the early 2000s (Breiman 2001), are part of the list of the most successful methods currently available to handle data in these cases. This supervised learning procedure, influenced by the early work of Amit and Geman (1997), Ho (1998), and Dietterich (2000), operates according to the simple but effective “divide and conquer” principle: sample fractions of the data, grow a randomized tree predictor on each small piece, then paste (aggregate) these predictors together. What has greatly contributed to the popularity of forests is the fact that they can be applied to a wide range of prediction problems and have few parameters to tune. Aside from being simple to use, the method is generally recognized for its accuracy and its ability to deal with small sample sizes and high-dimensional feature spaces. At the same time, it is easily parallelizable and has, therefore, the potential to deal with large real-life systems. The corresponding R package randomForest can be freely downloaded on the CRAN web site (http://www.r-project.org), while a MapReduce (Jeffrey and Sanja 2008) open source implementation called Partial Decision Forests is available on the Apache Mahout website at https://mahout.apache.org. This allows the building of forests using large data sets as long as each partition can be loaded into memory. The random forest methodology has been successfully involved in various practical problems, including a data science hackathon on air quality prediction (http:// www.kaggle.com/c/dsg-hackathon), chemoinformatics (Svetnik et al. 2003), ecology (Prasad et al. 2006; Cutler et al. 2007), 3D object recognition (Shotton et al. 2011), and bioinformatics (Díaz-Uriarte and de Andrés 2006), just to name a few. Howard (Kaggle) and Bowles (Biomatica) claim in Howard and Bowles (2012) that ensembles of decision trees—often known as “random forests”—have been the most successful general-purpose algorithm in modern times, while Varian, Chief Economist at Google, advocates in Varian (2014) the use of random forests in econometrics. On the theoretical side, the story of random forests is less conclusive and, despite their extensive use, little is known about the mathematical properties of the method. The most celebrated theoretical result is that of Breiman (2001), which offers an upper bound on the generalization error of forests in terms of correlation and strength of the individual trees. This was followed by a technical note (Breiman 2004), which focuses on a stylized version of the original algorithm (see also Breiman 2000a, b). A critical step was subsequently taken by Lin and Jeon (2006), who highlighted an interesting connection between random forests and a particular class of nearest neighbor predictors, further developed byBiau and Devroye (2010). In recent years, various theoretical studies have been performed (e.g., Meinshausen 2006; Biau et al. 2008; Ishwaran and Kogalur 2010; Biau 2012; Genuer 2012; Zhu et al. 2015), analyzing more elaborate 123
A random forest guided tour 199 models and moving ever closer to the practical situation.Recent attempts towards narrowing the gap between theory and practice include that of Denil et al.(2013), who prove the consistency of a particular online forest,Wager(2014)and Mentch and Hooker(2015),who study the asymptotic distribution of forests,and Scornet et al. (2015),who show that Breiman's(2001)forests are consistent in an additive regression framework. The difficulty in properly analyzing random forests can be explained by the black-box flavor of the method,which is indeed a subtle combination of different components.Among the forests'essential ingredients,both bagging (Breiman 1996) and the Classification And Regression Trees(CART)-split criterion(Breiman et al. 1984)play critical roles.Bagging (a contraction of bootstrap-aggregating)is a general aggregation scheme,which generates bootstrap samples from the original data set, constructs a predictor from each sample,and decides by averaging.It is one of the most effective computationally intensive procedures to improve on unstable estimates, especially for large,high-dimensional data sets,where finding a good model in one step is impossible because of the complexity and scale of the problem(Buhlmann and Yu 2002;Kleiner et al.2014;Wager et al.2014).As for the CART-split criterion,it originates from the influential CART program of Breiman et al.(1984),and is used in the construction of the individual trees to choose the best cuts perpendicular to the axes.At each node of each tree,the best cut is selected by optimizing the CART-split criterion,based on the so-called Gini impurity (for classification)or the prediction squared error(for regression). However,while bagging and the CART-splitting scheme play key roles in the ran- dom forest mechanism,both are difficult to analyze with rigorous mathematics,thereby explaining why theoretical studies have so far considered simplified versions of the original procedure.This is often done by simply ignoring the bagging step and/or replacing the CART-split selection by a more elementary cut protocol.As well as this, in Breiman's (2001)forests,each leaf (that is,a terminal node)of individual trees contains a small number of observations,typically between 1 and 5. The goal of this survey is to embark the reader on a guided tour of random forests. We focus on the theory behind the algorithm,trying to give an overview of major theoretical approaches while discussing their inherent pros and cons.For a more methodological review covering applied aspects of random forests,we refer to the surveys by Criminisi et al.(2011)and Boulesteix et al.(2012).We start by gently introducing the mathematical context in Sect.2 and describe in full detail Breiman's (2001)original algorithm.Section 3 focuses on the theory for a simplified forest model called purely random forests,and emphasizes the connections between forests,nearest neighbor estimates and kernel methods.Section 4 provides some elements of theory about resampling mechanisms,the splitting criterion and the mathematical forces at work in Breiman's approach.Section 5 is devoted to the theoretical aspects of con- nected variable selection procedures.Section 6 discusses various extensions to random forests including online learning,survival analysis and clustering problems.A short discussion follows in Sect.7. 2Springer
A random forest guided tour 199 models and moving ever closer to the practical situation. Recent attempts towards narrowing the gap between theory and practice include that of Denil et al. (2013), who prove the consistency of a particular online forest, Wager (2014) and Mentch and Hooker (2015), who study the asymptotic distribution of forests, and Scornet et al. (2015), who show that Breiman’s (2001) forests are consistent in an additive regression framework. The difficulty in properly analyzing random forests can be explained by the black-box flavor of the method, which is indeed a subtle combination of different components. Among the forests’ essential ingredients, both bagging (Breiman 1996) and the Classification And Regression Trees (CART)-split criterion (Breiman et al. 1984) play critical roles. Bagging (a contraction of bootstrap-aggregating) is a general aggregation scheme, which generates bootstrap samples from the original data set, constructs a predictor from each sample, and decides by averaging. It is one of the most effective computationally intensive procedures to improve on unstable estimates, especially for large, high-dimensional data sets, where finding a good model in one step is impossible because of the complexity and scale of the problem (Bühlmann and Yu 2002; Kleiner et al. 2014; Wager et al. 2014). As for the CART-split criterion, it originates from the influential CART program of Breiman et al. (1984), and is used in the construction of the individual trees to choose the best cuts perpendicular to the axes. At each node of each tree, the best cut is selected by optimizing the CART-split criterion, based on the so-called Gini impurity (for classification) or the prediction squared error (for regression). However, while bagging and the CART-splitting scheme play key roles in the random forest mechanism, both are difficult to analyze with rigorous mathematics, thereby explaining why theoretical studies have so far considered simplified versions of the original procedure. This is often done by simply ignoring the bagging step and/or replacing the CART-split selection by a more elementary cut protocol. As well as this, in Breiman’s (2001) forests, each leaf (that is, a terminal node) of individual trees contains a small number of observations, typically between 1 and 5. The goal of this survey is to embark the reader on a guided tour of random forests. We focus on the theory behind the algorithm, trying to give an overview of major theoretical approaches while discussing their inherent pros and cons. For a more methodological review covering applied aspects of random forests, we refer to the surveys by Criminisi et al. (2011) and Boulesteix et al. (2012). We start by gently introducing the mathematical context in Sect. 2 and describe in full detail Breiman’s (2001) original algorithm. Section 3 focuses on the theory for a simplified forest model called purely random forests, and emphasizes the connections between forests, nearest neighbor estimates and kernel methods. Section 4 provides some elements of theory about resampling mechanisms, the splitting criterion and the mathematical forces at work in Breiman’s approach. Section 5 is devoted to the theoretical aspects of connected variable selection procedures. Section 6 discusses various extensions to random forests including online learning, survival analysis and clustering problems. A short discussion follows in Sect. 7. 123
200 G.Biau,E.Scornet 2 The random forest estimate 2.1 Basic principles Let us start with a word of caution.The term"random forests"is a bit ambiguous.For some authors,it is but a generic expression for aggregating random decision trees,no matter how the trees are obtained.For others,it refers to Breiman's(2001)original algorithm.We essentially adopt the second point of view in the present survey. As mentioned above,the forest mechanism is versatile enough to deal with both supervised classification and regression tasks.However,to keep things simple,we focus in this introduction on regression analysis,and only briefly survey the classifi- cation case.Our objective in this section is to provide a concise but mathematically precise presentation of the algorithm for building a random forest.The general framework is nonparametric regression estimation,in which an input random vec- tor Xe'C RP is observed,and the goal is to predict the square integrable random response Y ER by estimating the regression function m(x)=E[YIX =x].With this aim in mind,we assume we are given a training sample Dn=((X1,Y1),...,(Xn,Y)) of independent random variables distributed as the independent prototype pair(X,Y). The goal is to use the data set Dn to construct an estimate mn:Rof the function m.In this respect,we say that the regression function estimate mn is(mean squared error)consistent if E[mn(X)-m(X)]0 as n-oo (the expectation is evaluated over X and the sample D). A random forest is a predictor consisting of a collection of M randomized regression trees.For the ith tree in the family,the predicted value at the query point x is denoted by m(x:j,D),where 1,...,M are independent random variables,distributed the same as a generic random variable and independent of Dn.In practice,the variable is used to resample the training set prior to the growing of individual trees and to select the successive directions for splitting-more precise definitions will be given later.In mathematical terms,the jth tree estimate takes the form mn(x;⊙j,Dn)= 1X:EAn(x:Oj.D.)Yi ieDa(⊙) Nn(x:⊙j,Dn)’ where D"(;)is the set of data points selected prior to the tree construction, An(x;j,Dn)is the cell containing x,and Nn(x;j,Dn)is the number of (pres- elected)points that fall into An(x;j,Dn). At this stage,we note that the trees are combined to form the (finite)forest estimate M mM,n(x:O1,,⊙M,Dn)= mn(x;⊙j,Dn) (1) M In the R package randomForest,the default value of M (the number of trees in the forest)is ntree 500.Since M may be chosen arbitrarily large (limited only by available computing resources),it makes sense,from a modeling point of view,to Springer
200 G. Biau, E. Scornet 2 The random forest estimate 2.1 Basic principles Let us start with a word of caution. The term “random forests” is a bit ambiguous. For some authors, it is but a generic expression for aggregating random decision trees, no matter how the trees are obtained. For others, it refers to Breiman’s (2001) original algorithm. We essentially adopt the second point of view in the present survey. As mentioned above, the forest mechanism is versatile enough to deal with both supervised classification and regression tasks. However, to keep things simple, we focus in this introduction on regression analysis, and only briefly survey the classifi- cation case. Our objective in this section is to provide a concise but mathematically precise presentation of the algorithm for building a random forest. The general framework is nonparametric regression estimation, in which an input random vector X ∈ X ⊂ Rp is observed, and the goal is to predict the square integrable random response Y ∈ R by estimating the regression function m(x) = E[Y |X = x]. With this aim in mind, we assume we are given a training sample Dn = ((X1, Y1), . . . , (Xn, Yn)) of independent random variables distributed as the independent prototype pair (X, Y ). The goal is to use the data set Dn to construct an estimate mn : X → R of the function m. In this respect, we say that the regression function estimate mn is (mean squared error) consistent if E[mn(X) − m(X)] 2 → 0 as n → ∞ (the expectation is evaluated over X and the sample Dn). A random forest is a predictor consisting of a collection of M randomized regression trees. For the jth tree in the family, the predicted value at the query point x is denoted by mn(x; Θj, Dn), where Θ1,...,ΘM are independent random variables, distributed the same as a generic random variable Θ and independent of Dn. In practice, the variable Θ is used to resample the training set prior to the growing of individual trees and to select the successive directions for splitting—more precise definitions will be given later. In mathematical terms, the jth tree estimate takes the form mn(x; Θj, Dn) = i∈D n (Θj) 1Xi∈An (x;Θj,Dn )Yi Nn(x; Θj, Dn) , where D n(Θj) is the set of data points selected prior to the tree construction, An(x; Θj, Dn) is the cell containing x, and Nn(x; Θj, Dn) is the number of (preselected) points that fall into An(x; Θj, Dn). At this stage, we note that the trees are combined to form the (finite) forest estimate m M,n(x; Θ1,...,ΘM , Dn) = 1 M M j=1 mn(x; Θj, Dn). (1) In the R package randomForest, the default value of M (the number of trees in the forest) is ntree = 500. Since M may be chosen arbitrarily large (limited only by available computing resources), it makes sense, from a modeling point of view, to 123
A random forest guided tour 201 let M tend to infinity,and consider instead of (1)the (infinite)forest estimate mo,n(X;Dn)=Eg[mn(x:⊙,Dn】. In this definition,E denotes the expectation with respect to the random parameter conditional on D.In fact,the operation"Moo"is justified by the law of large numbers,which asserts that almost surely,conditional on D lim mM,n(x:⊙1,,⊙M,Dn)=moo,n(X;Dn) M-0o (see for instance Breiman 2001,and Scornet 2015a,for more information on this limit calculation).In the following,to lighten notation we will simply write moo.n(x)instead of moo.n(x;Dn). 2.2 Algorithm We now provide some insight into how the individual trees are constructed and how randomness kicks in.In Breiman's (2001)original forests,each node of a single tree is associated with a hyperrectangular cell.The root of the tree is Y itself and,at each step of the tree construction,a node (or equivalently its corresponding cell)is split into two parts.The terminal nodes (or leaves),taken together,form a partition of The algorithm works by growing M different(randomized)trees as follows.Prior to the construction of each tree,an observations are drawn at random with(or without) replacement from the original data set.These-and only these-an observations(with possible repetitions)are taken into account in the tree building.Then,at each cell of each tree,a split is performed by maximizing the CART-criterion(see below)over mtry directions chosen uniformly at random among the p original ones.(The resulting subset of selected coordinates is called Mury.)Lastly,construction of individual trees is stopped when each cell contains less than nodesize points.For any query point xE t,each regression tree predicts the average of the Yi(that were among the an points) for which the corresponding X;falls into the cell of x.We draw attention to the fact that growing the tree and making the final estimation only depends on the an preselected data points.Algorithm 1 describes in full detail how to compute a forest's prediction. Algorithm I may seem a bit complicated at first sight,but the underlying ideas are simple.We start by noticing that this algorithm has three important parameters: 1.an(1....,n):the number of sampled data points in each tree; 2.mtry (1.....p):the number of possible directions for splitting at each node of each tree; 3.nodesize (1,...,an):the number of examples in each cell below which the cell is not split. By default,in the regression mode of the R package randomForest,the parameter mtry is set to [p/3]([.is the ceiling function),an is set to n,and nodesize is set to 5.The role and influence of these three parameters on the accuracy of the method will be thoroughly discussed in the next section. 2Springer
A random forest guided tour 201 let M tend to infinity, and consider instead of (1) the (infinite) forest estimate m∞,n(x; Dn) = EΘ [mn(x; Θ, Dn)] . In this definition, EΘ denotes the expectation with respect to the random parameter Θ, conditional on Dn. In fact, the operation “M → ∞” is justified by the law of large numbers, which asserts that almost surely, conditional on Dn, lim M→∞ m M,n(x; Θ1,...,ΘM , Dn) = m∞,n(x; Dn) (see for instance Breiman 2001, and Scornet 2015a, for more information on this limit calculation). In the following, to lighten notation we will simply write m∞,n(x)instead of m∞,n(x; Dn). 2.2 Algorithm We now provide some insight into how the individual trees are constructed and how randomness kicks in. In Breiman’s (2001) original forests, each node of a single tree is associated with a hyperrectangular cell. The root of the tree is X itself and, at each step of the tree construction, a node (or equivalently its corresponding cell) is split into two parts. The terminal nodes (or leaves), taken together, form a partition of X . The algorithm works by growing M different (randomized) trees as follows. Prior to the construction of each tree, an observations are drawn at random with (or without) replacement from the original data set. These—and only these—an observations (with possible repetitions) are taken into account in the tree building. Then, at each cell of each tree, a split is performed by maximizing the CART-criterion (see below) over mtry directions chosen uniformly at random among the p original ones. (The resulting subset of selected coordinates is calledMtry.) Lastly, construction of individual trees is stopped when each cell contains less than nodesize points. For any query point x ∈ X , each regression tree predicts the average of the Yi (that were among the an points) for which the corresponding Xi falls into the cell of x. We draw attention to the fact that growing the tree and making the final estimation only depends on the an preselected data points. Algorithm 1 describes in full detail how to compute a forest’s prediction. Algorithm 1 may seem a bit complicated at first sight, but the underlying ideas are simple. We start by noticing that this algorithm has three important parameters: 1. an ∈ {1,..., n}: the number of sampled data points in each tree; 2. mtry ∈ {1,..., p}: the number of possible directions for splitting at each node of each tree; 3. nodesize ∈ {1,..., an}: the number of examples in each cell below which the cell is not split. By default, in the regression mode of the R package randomForest, the parameter mtry is set to p/3 (· is the ceiling function), an is set to n, and nodesize is set to 5. The role and influence of these three parameters on the accuracy of the method will be thoroughly discussed in the next section. 123
202 G.Biau,E.Scornet Algorithm 1:Breiman's random forest predicted value at x. Input:Training set Dn,number of trees M>0.an (1.....n).mtry e(1.....p). nodesize E(1....,an),and x E&. Output:Prediction of the random forest at x. 1 for j=1,....M do 2 Select an points,with (or without)replacement,uniformly in D.In the following steps,only these an observations are used. Set P=()the list containing the cell associated with the root of the tree. Set Pinal =0 an empty list. while P≠gdo Let A be the first element of P. if A contains less than nodesize points or if all Xi E A are equal then Remove the cell A from the list P. Pinal -Concatenate(Prinal,A). 10 else 11 Select uniformly,without replacement,a subset Mury C(1.....p)of cardinality mtry. Select the best split in A by optimizing the CART-split criterion along the coordinates in Mury (see text for details). 13 Cut the cell A according to the best split.Call AL and Ag the two resulting cells. Remove the cell A from the list P. P+Concatenate(P.AL.AR). end end 18 Compute the predicted value mn(x:j.Dn)at x equal to the average of the Yi falling in the cell of x in partition Pfinal- 19 end 20 Compute the random forest estimate m.n(x:1.....Dn)at the query point x according to (10. We still have to describe how the CART-split criterion operates.As for now,we consider for the ease of understanding a tree with no subsampling,which uses the entire and original data set D for its construction.In addition,we let A be a generic cell and denote by N(A)the number of data points falling in A.A cut in A is a pair (j,z),where j is some value (dimension)from(1,...,p)and z the position of the cut along the ith coordinate,within the limits of A.Let CA be the set of all such possible cuts in A.Then,with the notation=).for any (j.)CA.the CART-split criterion takes the form 1 LenU,3=Nn(A ∑(Y-A)21xeA 1 西%-a10:-lnP1a (2) where AL=(x∈A:x)<z,AR={x∈A:xD≥z,and YA(resp,Au,卫AR) is the average of the Yi such that Xi belongs to A(resp.,AL,AR),with the convention that the average is equal to 0 when no point Xi belongs to A(resp.,AL,AR).For each cell A.the best cut (is selected by maximizing Lreg.n(j.)over Mury and CA: that is, Springer
202 G. Biau, E. Scornet Algorithm 1: Breiman’s random forest predicted value at x. Input: Training set Dn, number of trees M > 0, an ∈ {1,..., n}, mtry ∈ {1,..., p}, nodesize ∈ {1,..., an}, and x ∈ X. Output: Prediction of the random forest at x. 1 for j = 1,..., M do 2 Select an points, with (or without) replacement, uniformly in Dn. In the following steps, only these an observations are used. 3 Set P = (X) the list containing the cell associated with the root of the tree. 4 Set Pfinal = ∅ an empty list. 5 while P = ∅ do 6 Let A be the first element of P. 7 if A contains less than nodesize points or if all Xi ∈ A are equal then 8 Remove the cell A from the list P. 9 Pfinal ← Concatenate(Pfinal, A). 10 else 11 Select uniformly, without replacement, a subset Mtry ⊂ {1,..., p} of cardinality mtry. 12 Select the best split in A by optimizing the CART-split criterion along the coordinates in Mtry (see text for details). 13 Cut the cell A according to the best split. Call AL and AR the two resulting cells. 14 Remove the cell A from the list P. 15 P ← Concatenate(P, AL , AR). 16 end 17 end 18 Compute the predicted value mn(x; Θj, Dn) at x equal to the average of the Yi falling in the cell of x in partition Pfinal. 19 end 20 Compute the random forest estimate mM,n(x; Θ1,...,ΘM , Dn) at the query point x according to (1). We still have to describe how the CART-split criterion operates. As for now, we consider for the ease of understanding a tree with no subsampling, which uses the entire and original data set Dn for its construction. In addition, we let A be a generic cell and denote by Nn(A) the number of data points falling in A. A cut in A is a pair (j,z), where j is some value (dimension) from {1,..., p} and z the position of the cut along the jth coordinate, within the limits of A. Let CA be the set of all such possible cuts in A. Then, with the notation Xi = (X(1) i ,..., X(p) i ), for any (j,z) ∈ CA, the CART-split criterion takes the form Lreg,n(j,z) = 1 Nn(A) n i=1 (Yi − Y¯A) 21Xi∈A − 1 Nn(A) n i=1 (Yi − Y¯AL1X(j) i <z − Y¯AR 1X(j) i ≥z ) 21Xi∈A, (2) where AL = {x ∈ A : x(j) < z}, AR = {x ∈ A : x(j) ≥ z}, and Y¯A (resp., Y¯AL , Y¯AR ) is the average of the Yi such that Xi belongs to A (resp., AL , AR), with the convention that the average is equal to 0 when no point Xi belongs to A (resp., AL , AR). For each cell A, the best cut (j n ,z n) is selected by maximizing Lreg,n(j,z) over Mtry and CA; that is, 123
A random forest guided tour 203 (j流,z)∈arg max Lreg,n(j,z). jEMuy (j.2)ECA (To remove some of the ties in the argmax,the best cut is always performed in the middle of two consecutive data points.)Let us finally notice that the above optimiza- tion program extends effortlessly to the resampling case,by optimizing over the an preselected observations instead of the original data set D. Thus,at each cell of each tree,the algorithm chooses uniformly at random mtry coordinates in (1,....p),evaluates criterion(2)over all possible cuts along the direc- tions in Mtry,and returns the best one.The quality measure(2)is the criterion used in the CART algorithm of Breiman et al.(1984).This criterion computes the(renor- malized)difference between the empirical variance in the node before and after a cut is performed.There are three essential differences between CART and a tree of Breiman's(2001)forest.First of all,in Breiman's forests,the criterion(2)is evaluated over a subset Mury of randomly selected coordinates,and not over the whole range (1,...,p).Besides,the individual trees are not pruned,and the final cells do not contain more than nodesize observations (unless all data points in the cell have the same Xi).At last,each tree is constructed on a subset of an examples picked within the initial sample,not on the whole sample D;and only these an observations are used to calculate the estimation.When an=n(and the resampling is done with replacement), the algorithm runs in bootstrap mode,whereas anP[Y =0X=x] m*(x)= 0 otherwise. In the classification context,the random forest classifier is obtained via a majority vote among the classification trees,that is, mM.n(X;⊙1,,⊙M,Dm)= 1ifa∑1mn(xO,D)>1/2 0 otherwise. 2Springer
A random forest guided tour 203 (j n ,z n) ∈ arg max j∈Mtry (j,z)∈CA Lreg,n(j,z). (To remove some of the ties in the argmax, the best cut is always performed in the middle of two consecutive data points.) Let us finally notice that the above optimization program extends effortlessly to the resampling case, by optimizing over the an preselected observations instead of the original data set Dn. Thus, at each cell of each tree, the algorithm chooses uniformly at random mtry coordinates in {1,..., p}, evaluates criterion (2) over all possible cuts along the directions in Mtry, and returns the best one. The quality measure (2) is the criterion used in the CART algorithm of Breiman et al. (1984). This criterion computes the (renormalized) difference between the empirical variance in the node before and after a cut is performed. There are three essential differences between CART and a tree of Breiman’s (2001) forest. First of all, in Breiman’s forests, the criterion (2) is evaluated over a subset Mtry of randomly selected coordinates, and not over the whole range {1,..., p}. Besides, the individual trees are not pruned, and the final cells do not contain more than nodesize observations (unless all data points in the cell have the same Xi). At last, each tree is constructed on a subset of an examples picked within the initial sample, not on the whole sample Dn; and only these an observations are used to calculate the estimation. When an = n (and the resampling is done with replacement), the algorithm runs in bootstrap mode, whereas an P[Y = 0|X = x] 0 otherwise. In the classification context, the random forest classifier is obtained via a majority vote among the classification trees, that is, m M,n(x; Θ1,...,ΘM , Dn) = 1 if 1 M M j=1 mn(x; Θj, Dn) > 1/2 0 otherwise. 123
204 G.Biau,E.Scornet If a leaf represents region A,then a randomized tree classifier takes the simple form if ∑1xeA.y,=> ∑1xeA,Y=0,X∈A mn(X;⊙j,Dn) ieD(⊙j) ieD:(ej) 0 otherwise, where D"(;)contains the data points selected in the resampling step,that is,in each leaf,a majority vote is taken over all (Xi,Yi)for which Xi is in the same region.Ties are broken,by convention,in favor of class 0.Algorithm 1 can be easily adapted to do two-class classification without modifying the CART-split criterion.To see this,take Y E(0,1)and consider a single tree with no subsampling step.For any generic cell A,let po.n(A)(resp.,pin(A))be the empirical probability,given a data point in a cell A,that it has label 0 (resp.,label 1).By noticing that YA =pi.n(A)=1-po.n(A). the classification CART-split criterion reads,for any (j,z)ECA, Lcassn(j)=po(A)p(A)-N(AL) Nn(A) ×P0,n(AL)P1.n(AL) Nn(AR) ×p0.n(AR)p1,n(AR). Nn(A) This criterion is based on the so-called Gini impurity measure 2po.n(A)pi.n(A) (Breiman et al.1984),which has the following simple interpretation.To classify a data point that falls in cell A,one uses the rule that assigns a point,uniformly selected from{X∈A:(X,Yi)∈Dn,to label e with probability pe.n(A),forj∈{0,l. The estimated probability that the item has actually label e is pe.n(A).Therefore,the estimated error under this rule is the Gini index 2po.n(A)pI.n(A).Note,however,that the prediction strategy is different in classification and regression:in the classification regime,each tree uses a local majority vote,whereas in regression the prediction is achieved by a local averaging. When dealing with classification problems,it is usually recommended to set nodesize 1 and mtry =p(see,e.g.,Liaw and Wiener 2002). We draw attention to the fact that regression estimation may also have an interest in the context of dichotomous and multicategory outcome variables (in this case,it is often termed probability estimation).For example,estimating outcome probabilities for individuals is important in many areas of medicine,with applications to surgery, oncology,internal medicine,pathology,pediatrics,and human genetics.We refer the interested reader to Malley et al.(2012)and to the survey papers by Kruppa et al. 2014a,b). 2.4 Parameter tuning Literature focusing on tuning the parameters M,mtry,nodesize and an is unfortu- nately rare,with the notable exception of Diaz-Uriarte and de Andres(2006),Bernard et al.(2008),and Genuer et al.(2010).According to Schwarz et al.(2010),tuning the forest parameters may result in a computational burden,in particular for big data sets,with hundreds and thousands of samples and variables.To circumvent this issue, Springer
204 G. Biau, E. Scornet If a leaf represents region A, then a randomized tree classifier takes the simple form mn(x; Θj, Dn) = ⎧ ⎨ ⎩ 1 if i∈D n (Θj) 1Xi∈A,Yi=1 > i∈D n (Θj) 1Xi∈A,Yi=0, x ∈ A 0 otherwise, where D n(Θj) contains the data points selected in the resampling step, that is, in each leaf, a majority vote is taken over all (Xi, Yi) for which Xi is in the same region. Ties are broken, by convention, in favor of class 0. Algorithm 1 can be easily adapted to do two-class classification without modifying the CART-split criterion. To see this, take Y ∈ {0, 1} and consider a single tree with no subsampling step. For any generic cell A, let p0,n(A) (resp., p1,n(A)) be the empirical probability, given a data point in a cell A, that it has label 0 (resp., label 1). By noticing that Y¯A = p1,n(A) = 1 − p0,n(A), the classification CART-split criterion reads, for any (j,z) ∈ CA, Lclass,n(j,z) = p0,n(A)p1,n(A) − Nn(AL ) Nn(A) × p0,n(AL )p1,n(AL ) − Nn(AR) Nn(A) × p0,n(AR)p1,n(AR). This criterion is based on the so-called Gini impurity measure 2p0,n(A)p1,n(A) (Breiman et al. 1984), which has the following simple interpretation. To classify a data point that falls in cell A, one uses the rule that assigns a point, uniformly selected from {Xi ∈ A : (Xi, Yi) ∈ Dn}, to label with probability p,n(A), for j ∈ {0, 1}. The estimated probability that the item has actually label is p,n(A). Therefore, the estimated error under this rule is the Gini index 2p0,n(A)p1,n(A). Note, however, that the prediction strategy is different in classification and regression: in the classification regime, each tree uses a local majority vote, whereas in regression the prediction is achieved by a local averaging. When dealing with classification problems, it is usually recommended to set nodesize = 1 and mtry = √p (see, e.g., Liaw and Wiener 2002). We draw attention to the fact that regression estimation may also have an interest in the context of dichotomous and multicategory outcome variables (in this case, it is often termed probability estimation). For example, estimating outcome probabilities for individuals is important in many areas of medicine, with applications to surgery, oncology, internal medicine, pathology, pediatrics, and human genetics. We refer the interested reader to Malley et al. (2012) and to the survey papers by Kruppa et al. (2014a, b). 2.4 Parameter tuning Literature focusing on tuning the parameters M, mtry, nodesize and an is unfortunately rare, with the notable exception of Díaz-Uriarte and de Andrés (2006), Bernard et al. (2008), and Genuer et al. (2010). According to Schwarz et al. (2010), tuning the forest parameters may result in a computational burden, in particular for big data sets, with hundreds and thousands of samples and variables. To circumvent this issue, 123
A random forest guided tour 205 Schwarz et al.(2010)implement a fast version of the original algorithm,which they name Random Jungle. It is easy to see that the forest's variance decreases as M grows.Thus,more accurate predictions are likely to be obtained by choosing a large number of trees.Interestingly, picking a large M does not lead to overfitting.In effect,following an argument of Breiman (2001),we have lim E[mM.n(X:01.....M)-m(X)=E[moo.n(X)-m(X)]2. M00 However,the computational cost for inducing a forest increases linearly with M,so a good choice results from a trade-off between computational complexity (M should not be too large for the computations to finish in a reasonable time)and accuracy (M must be large enough for predictions to be stable).In this respect,Diaz-Uriarte and de Andres(2006)argue that the value of M is irrelevant(provided that M is large enough) in a prediction problem involving microarray data sets,where the aim is to classify patients according to their genetic profiles(typically,less than one hundred patients for several thousand genes).For more details we refer the reader to Genuer et al. (2010),who offer a thorough discussion on the choice of this parameter in various regression problems.Another interesting and related approach is by Latinne et al. (2001),who propose a simple procedure that determines a priori a minimum number of tree estimates to combine to obtain a prediction accuracy level similar to that of a larger forest.Their experimental results show that it is possible to significantly limit the number of trees. In the R package randomForest,the default value of the parameter nodesize is 1 for classification and 5 for regression.These values are often reported to be good choices (e.g.,Diaz-Uriarte and de Andres 2006),despite the fact that this is not supported by solid theory.A simple algorithm to tune the parameter nodesize in the classification setting is discussed in Kruppa et al.(2013). The effect ofmtry is thoroughly investigated in Diaz-Uriarte and de Andres(2006), who show that this parameter has a little impact on the performance of the method, though larger values may be associated with a reduction in the predictive performance. On the other hand,Genuer et al.(2010)claim that the default value of mtry is either optimal or too small.Therefore,a conservative approach is to take mtry as large as possible (limited by available computing resources)and set mtry p(recall that p is the dimension of the Xi).A data-driven choice of mtry is implemented in the algorithm Forest-RK of Bernard et al.(2008). Let us finally notice that even if there is no theoretical guarantee to support the default values of the parameters,they are nevertheless easy to tune without requiring an independent validation set.Indeed,the procedure accuracy is estimated internally, during the run,as follows.Since each tree is constructed using a different bootstrap sample from the original data,about one-third of the observations are left out of the bootstrap sample and not used in the construction of the jth tree.In this way,for each tree,a test set-disjoint from the training set-is obtained,and averaging over all these left-out data points and over all trees is known as the out-of-bag error estimate.Thus, the out-of-bag error,computed on the observations set aside by the resampling prior 2Springer
A random forest guided tour 205 Schwarz et al. (2010) implement a fast version of the original algorithm, which they name Random Jungle. It is easy to see that the forest’s variance decreases as M grows. Thus, more accurate predictions are likely to be obtained by choosing a large number of trees. Interestingly, picking a large M does not lead to overfitting. In effect, following an argument of Breiman (2001), we have lim M→∞ E[m M,n(X; Θ1,...,ΘM ) − m(X)] 2 = E[m∞,n(X) − m(X)] 2. However, the computational cost for inducing a forest increases linearly with M, so a good choice results from a trade-off between computational complexity (M should not be too large for the computations to finish in a reasonable time) and accuracy (M must be large enough for predictions to be stable). In this respect, Díaz-Uriarte and de Andrés(2006) argue that the value of M is irrelevant (provided that M is large enough) in a prediction problem involving microarray data sets, where the aim is to classify patients according to their genetic profiles (typically, less than one hundred patients for several thousand genes). For more details we refer the reader to Genuer et al. (2010), who offer a thorough discussion on the choice of this parameter in various regression problems. Another interesting and related approach is by Latinne et al. (2001), who propose a simple procedure that determines a priori a minimum number of tree estimates to combine to obtain a prediction accuracy level similar to that of a larger forest. Their experimental results show that it is possible to significantly limit the number of trees. In the R package randomForest, the default value of the parameter nodesize is 1 for classification and 5 for regression. These values are often reported to be good choices (e.g., Díaz-Uriarte and de Andrés 2006), despite the fact that this is not supported by solid theory. A simple algorithm to tune the parameter nodesize in the classification setting is discussed in Kruppa et al. (2013). The effect of mtry is thoroughly investigated in Díaz-Uriarte and de Andrés(2006), who show that this parameter has a little impact on the performance of the method, though larger values may be associated with a reduction in the predictive performance. On the other hand, Genuer et al. (2010) claim that the default value of mtry is either optimal or too small. Therefore, a conservative approach is to take mtry as large as possible (limited by available computing resources) and set mtry = p (recall that p is the dimension of the Xi). A data-driven choice of mtry is implemented in the algorithm Forest-RK of Bernard et al. (2008). Let us finally notice that even if there is no theoretical guarantee to support the default values of the parameters, they are nevertheless easy to tune without requiring an independent validation set. Indeed, the procedure accuracy is estimated internally, during the run, as follows. Since each tree is constructed using a different bootstrap sample from the original data, about one-third of the observations are left out of the bootstrap sample and not used in the construction of the jth tree. In this way, for each tree, a test set—disjoint from the training set—is obtained, and averaging over all these left-out data points and over all trees is known as the out-of-bag error estimate. Thus, the out-of-bag error, computed on the observations set aside by the resampling prior 123
206 G.Biau,E.Scornet to the tree building,offers a simple way to adjust the parameters without the need of a validation set.(e.g.,Kruppa et al.2013). 3 Simplified models and local averaging estimates 3.1 Simplified models Despite their widespread use,a gap remains between the theoretical understanding of random forests and their practical performance.This algorithm,which relies on complex data-dependent mechanisms,is difficult to analyze and its basic mathematical properties are still not well understood. As observed by Denil et al.(2014),this state of affairs has led to polarization between theoretical and empirical contributions to the literature.Empirically focused papers describe elaborate extensions to the basic random forest framework but come with no clear guarantees.In contrast,most theoretical papers focus on simplifications or stylized versions of the standard algorithm,where the mathematical analysis is more tractable. A basic framework to assess the theoretical properties of forests involves models in which partitions do not depend on the training set Dn.This family of simplified models is often called purely random forests,for which =[0,1]4.A widespread example is the centered forest,whose principle is as follows:(i)there is no resampling step;(ii)at each node of each individual tree,a coordinate is uniformly chosen in (1,....p);and (iii)a split is performed at the center of the cell along the selected coordinate.The operations (ii)-(iii)are recursively repeated k times,where k E N is a parameter of the algorithm.The procedure stops when a full binary tree with k levels is reached,so that each tree ends up with exactly 2k leaves.The final estimation at the query point x is achieved by averaging the Yi corresponding to the Xi in the cell of x. The parameter k acts as a smoothing parameter that controls the size of the terminal cells(see Fig.I for an example in two dimensions).It should be chosen large enough to detect local changes in the distribution,but not too much to guarantee an effective averaging process in the leaves.In uniform random forests,a variant of centered forests, cuts are performed uniformly at random over the range of the selected coordinate,not at the center.Modulo some minor modifications,their analysis is similar. The centered forest rule was first formally analyzed by Breiman (2004),and then later by Biau et al.(2008)and Scornet(2015a),who proved that the method is consistent (both for classification and regression)provided koo and n/2%oo.The proof relies on a general consistency result for random trees stated in Devroye et al.(1996, Chapter 6).If X is uniformly distributed in'=[0,1]P,then there are on average about n/2*data points per terminal node.In particular,the choice k log n corresponds to obtaining a small number of examples in the leaves,in accordance with Breiman's (2001)idea that the individual trees should not be pruned.Unfortunately,this choice of k does not satisfy the condition n/2koo,so something is lost in the analysis. Moreover,the bagging step is absent,and forest consistency is obtained as a by- product of tree consistency.Overall,this model does not demonstrate the benefit of Springer
206 G. Biau, E. Scornet to the tree building, offers a simple way to adjust the parameters without the need of a validation set. (e.g., Kruppa et al. 2013). 3 Simplified models and local averaging estimates 3.1 Simplified models Despite their widespread use, a gap remains between the theoretical understanding of random forests and their practical performance. This algorithm, which relies on complex data-dependent mechanisms, is difficult to analyze and its basic mathematical properties are still not well understood. As observed by Denil et al. (2014), this state of affairs has led to polarization between theoretical and empirical contributions to the literature. Empirically focused papers describe elaborate extensions to the basic random forest framework but come with no clear guarantees. In contrast, most theoretical papers focus on simplifications or stylized versions of the standard algorithm, where the mathematical analysis is more tractable. A basic framework to assess the theoretical properties of forests involves models in which partitions do not depend on the training set Dn. This family of simplified models is often called purely random forests, for which X = [0, 1] d . A widespread example is the centered forest, whose principle is as follows: (i) there is no resampling step; (ii) at each node of each individual tree, a coordinate is uniformly chosen in {1,..., p}; and (iii) a split is performed at the center of the cell along the selected coordinate. The operations (ii)–(iii) are recursively repeated k times, where k ∈ N is a parameter of the algorithm. The procedure stops when a full binary tree with k levels is reached, so that each tree ends up with exactly 2k leaves. The final estimation at the query point x is achieved by averaging the Yi corresponding to the Xi in the cell of x. The parameter k acts as a smoothing parameter that controls the size of the terminal cells (see Fig. 1 for an example in two dimensions). It should be chosen large enough to detect local changes in the distribution, but not too much to guarantee an effective averaging process in the leaves. In uniform random forests, a variant of centered forests, cuts are performed uniformly at random over the range of the selected coordinate, not at the center. Modulo some minor modifications, their analysis is similar. The centered forest rule was first formally analyzed by Breiman (2004), and then later byBiau et al.(2008) and Scornet(2015a), who proved that the method is consistent (both for classification and regression) provided k → ∞ and n/2k → ∞. The proof relies on a general consistency result for random trees stated in Devroye et al. (1996, Chapter 6). If X is uniformly distributed in X = [0, 1]p, then there are on average about n/2k data points per terminal node. In particular, the choice k ≈ log n corresponds to obtaining a small number of examples in the leaves, in accordance with Breiman’s (2001) idea that the individual trees should not be pruned. Unfortunately, this choice of k does not satisfy the condition n/2k → ∞, so something is lost in the analysis. Moreover, the bagging step is absent, and forest consistency is obtained as a byproduct of tree consistency. Overall, this model does not demonstrate the benefit of 123