正在加载图片...
A random forest guided tour 199 models and moving ever closer to the practical situation.Recent attempts towards narrowing the gap between theory and practice include that of Denil et al.(2013), who prove the consistency of a particular online forest,Wager(2014)and Mentch and Hooker(2015),who study the asymptotic distribution of forests,and Scornet et al. (2015),who show that Breiman's(2001)forests are consistent in an additive regression framework. The difficulty in properly analyzing random forests can be explained by the black-box flavor of the method,which is indeed a subtle combination of different components.Among the forests'essential ingredients,both bagging (Breiman 1996) and the Classification And Regression Trees(CART)-split criterion(Breiman et al. 1984)play critical roles.Bagging (a contraction of bootstrap-aggregating)is a general aggregation scheme,which generates bootstrap samples from the original data set, constructs a predictor from each sample,and decides by averaging.It is one of the most effective computationally intensive procedures to improve on unstable estimates, especially for large,high-dimensional data sets,where finding a good model in one step is impossible because of the complexity and scale of the problem(Buhlmann and Yu 2002;Kleiner et al.2014;Wager et al.2014).As for the CART-split criterion,it originates from the influential CART program of Breiman et al.(1984),and is used in the construction of the individual trees to choose the best cuts perpendicular to the axes.At each node of each tree,the best cut is selected by optimizing the CART-split criterion,based on the so-called Gini impurity (for classification)or the prediction squared error(for regression). However,while bagging and the CART-splitting scheme play key roles in the ran- dom forest mechanism,both are difficult to analyze with rigorous mathematics,thereby explaining why theoretical studies have so far considered simplified versions of the original procedure.This is often done by simply ignoring the bagging step and/or replacing the CART-split selection by a more elementary cut protocol.As well as this, in Breiman's (2001)forests,each leaf (that is,a terminal node)of individual trees contains a small number of observations,typically between 1 and 5. The goal of this survey is to embark the reader on a guided tour of random forests. We focus on the theory behind the algorithm,trying to give an overview of major theoretical approaches while discussing their inherent pros and cons.For a more methodological review covering applied aspects of random forests,we refer to the surveys by Criminisi et al.(2011)and Boulesteix et al.(2012).We start by gently introducing the mathematical context in Sect.2 and describe in full detail Breiman's (2001)original algorithm.Section 3 focuses on the theory for a simplified forest model called purely random forests,and emphasizes the connections between forests,nearest neighbor estimates and kernel methods.Section 4 provides some elements of theory about resampling mechanisms,the splitting criterion and the mathematical forces at work in Breiman's approach.Section 5 is devoted to the theoretical aspects of con- nected variable selection procedures.Section 6 discusses various extensions to random forests including online learning,survival analysis and clustering problems.A short discussion follows in Sect.7. 2SpringerA random forest guided tour 199 models and moving ever closer to the practical situation. Recent attempts towards narrowing the gap between theory and practice include that of Denil et al. (2013), who prove the consistency of a particular online forest, Wager (2014) and Mentch and Hooker (2015), who study the asymptotic distribution of forests, and Scornet et al. (2015), who show that Breiman’s (2001) forests are consistent in an additive regression framework. The difficulty in properly analyzing random forests can be explained by the black-box flavor of the method, which is indeed a subtle combination of different components. Among the forests’ essential ingredients, both bagging (Breiman 1996) and the Classification And Regression Trees (CART)-split criterion (Breiman et al. 1984) play critical roles. Bagging (a contraction of bootstrap-aggregating) is a general aggregation scheme, which generates bootstrap samples from the original data set, constructs a predictor from each sample, and decides by averaging. It is one of the most effective computationally intensive procedures to improve on unstable estimates, especially for large, high-dimensional data sets, where finding a good model in one step is impossible because of the complexity and scale of the problem (Bühlmann and Yu 2002; Kleiner et al. 2014; Wager et al. 2014). As for the CART-split criterion, it originates from the influential CART program of Breiman et al. (1984), and is used in the construction of the individual trees to choose the best cuts perpendicular to the axes. At each node of each tree, the best cut is selected by optimizing the CART-split criterion, based on the so-called Gini impurity (for classification) or the prediction squared error (for regression). However, while bagging and the CART-splitting scheme play key roles in the ran￾dom forest mechanism, both are difficult to analyze with rigorous mathematics, thereby explaining why theoretical studies have so far considered simplified versions of the original procedure. This is often done by simply ignoring the bagging step and/or replacing the CART-split selection by a more elementary cut protocol. As well as this, in Breiman’s (2001) forests, each leaf (that is, a terminal node) of individual trees contains a small number of observations, typically between 1 and 5. The goal of this survey is to embark the reader on a guided tour of random forests. We focus on the theory behind the algorithm, trying to give an overview of major theoretical approaches while discussing their inherent pros and cons. For a more methodological review covering applied aspects of random forests, we refer to the surveys by Criminisi et al. (2011) and Boulesteix et al. (2012). We start by gently introducing the mathematical context in Sect. 2 and describe in full detail Breiman’s (2001) original algorithm. Section 3 focuses on the theory for a simplified forest model called purely random forests, and emphasizes the connections between forests, nearest neighbor estimates and kernel methods. Section 4 provides some elements of theory about resampling mechanisms, the splitting criterion and the mathematical forces at work in Breiman’s approach. Section 5 is devoted to the theoretical aspects of con￾nected variable selection procedures. Section 6 discusses various extensions to random forests including online learning, survival analysis and clustering problems. A short discussion follows in Sect. 7. 123
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有