Data-Snooping Biases in Tests of Financial Asset Pricing Models STOR Andrew W.Lo;A.Craig MacKinlay The Review of Financial Studies,Volume 3,Issue 3 (1990).431-467. Stable URL: http://links.jstor.org/sici?sici=0893-9454%281990%293%3A3%3C431%3ADBITOF3E2.0.CO%3B2-9 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use,available at http://www.jstor.org/about/terms.html.JSTOR's Terms and Conditions of Use provides,in part,that unless you have obtained prior permission,you may not download an entire issue of a journal or multiple copies of articles,and you may use content in the JSTOR archive only for your personal,non-commercial use. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. The Review of Financial Studies is published by Oxford University Press.Please contact the publisher for further permissions regarding the use of this work.Publisher contact information may be obtained at http://www.jstor.org/journals/oup.html. The Review of Financial Studies 1990 Oxford University Press JSTOR and the JSTOR logo are trademarks of JSTOR,and are Registered in the U.S.Patent and Trademark Office. For more information on JSTOR contact jstor-info@umich.edu. ©2003 JSTOR http://www.jstor.org/ Tue Feb1801:29:472003
Data-Snooping Biases in Tests of Financial Asset Pricing Models Andrew W.Lo Sloan School of Management Massachusetts Institute of Technology A.Craig MacKinlay Wharton School University of Pennsylvania Tests of financial asset pricing models may yield misleading inferences wben properties of tbe data are used to construct the test statistics.In partic- ular,sucb tests are often based on returns to port- folios of common stock,wbere portfolios are con- structed by sorting on some empirically motivated cbaracteristic of tbe securities such as market value ofequity.Analytical calculations,Monte Carlo sim- ulations,and two empirical examples sbow that the effects of tbis type of data snooping can be substantial. The reliance of economic science upon nonexperi- mental inference is,at once,one of the most chal- lenging and most nettlesome aspects of the disci. pline.Because of the virtual impossibility of controlled experimentation in economics,the importance of sta- Research support from the Batterymarch Fellowship (Lo),the Geewax-Ter- ker Research Fund (MacKinlay),the John M.Olin Fellowship at the National Bureau of Economic Research (Lo),and the National Science Foundation (SES-8821583)is gratefully acknowledged.We thank David Aldous,Cliff Ball,Michael Brennan,Herbert David,Mike Gibbons,Jay Shanken,a referee, and seminar participants at the Board of Governors of the Federal Reserve, Boston College,Columbia,Dartmouth,Harvard,M.I.T.,Northwestern, Princeton,Stanford,University of Chicago,University of Michigan,University of Wisconsin at Madison,Washington University,and Wharton for useful comments and suggestions.Address reprint requests to Andrew Lo,Sloan School of Management,M.I.T.,50 Memorial Drive,Cambridge,MA 02139. Tbe Review of Financial Studies 1990 Volume 3,number 3,pp.431-467 1990 The Review of Financial Studies 0893-9454/90/$1.50
Tbe Revtew of Financial Studies/v 3 n 3 1990 tistical data analysis is now well-established.However,there is a growing concern that the procedures under which formal statistical inference have been developed may not correspond to those followed in practice.For example,the classical statistical approach to selecting a method of estimation generally involves minimizing an expected loss function,irrespective of the actual data.Yet in practice the prop- erties of the realized data almost always influence the choice of esti- mator. Of course,ignoring obvious features of the data can lead to non- sensical inferences even when the estimation procedures are optimal in some metric.But the way we incorporate those features into our estimation and testing procedures can affect subsequent inferences considerably.Indeed,by the very nature of empirical innovation in economics,the axioms of classical statistical analysis are violated routinely:future research is often motivated by the successes and failures of past investigations.Consequently,few empirical studies are free of the kind of data-instigated pretest biases discussed in Leamer (1978).Moreover,we can expect the degree of such biases to increase with the number of published studies performed on any single data set-the more scrutiny a collection of data is subjected to,the more likely will interesting (spurious)patterns emerge.Since stock market prices are perhaps the most studied economic quantities to date,tests of financial asset pricing models seem especially sus- ceptible. In this paper,we attempt to quantify the inferential biases associ- ated with one particular method of testing financial asset pricing models such as the capital asset pricing model (CAPM)and the arbi- trage pricing theory (APT).Because there are often many more secu- rities than there are time series observations of stock returns,asset pricing tests are generally performed on the returns of portfolios of securities.Besides reducing the cross-sectional dimension of the joint distribution of returns,grouping into portfolios has also been advanced as a method of reducing the impact of measurement error.However, the selection of securities to be included in a given portfolio is almost never at random,but is often based on some of the stocks'empirical characteristics.The formation of size-sorted portfolios,portfolios based on the market value of the companies'equity,is but one example. Conducting classical statistical tests on portfolios formed this way creates potentially significant biases in the test statistics.These are Perhaps the most complete analysis of such issues in economic applications is by Leamer(1978). Recent papers by Lakonishok and Smidt (1988),Merton (1987),and Ross(1987)address data snooping in financial economics.Of course,data snooping has been a concern among probabilists and statisticians for quite some time,and is at least as old as the controversy between Bayesian and classical statisticians.Interested readers should consult Berger and Wolpert (1984,chapter 4.2)and Leamer (1978,chapter 9)for further discussion. 432
Data-Snooping Biases examples of"data-snooping statistics,"a term used by Aldous (1989, p.252)to describe the situation "where you have a family of test statistics T(a)whose null distribution is known for fixed a,but where you use the test statistic T=T(a)for some a chosen using the data." In our application the quantity a may be viewed as a vector of zeros and ones that indicates which securities are to be included in or omitted from a given portfolio.If the choice of a is based on the data,then the sampling distribution of the resulting test statistic is generally not the same as the null distribution with a fixed a;hence, the actual size of the test may differ substantially from its nominal value under the null.Under plausible assumptions our calculations show that this kind of data snooping can lead to rejections of the null hypothesis with probability 1 even when the null hypothesis is true! Although the term"data snooping"may have an unsavory conno- tation,our usage neither implies nor infers any sort of intentional misrepresentation or dishonesty.That prior empirical research may influence the way current investigations are conducted is often un- avoidable,and this very fact results in what we have called data snoop- ing.Moreover,it is not at all apparent that this phenomenon neces- sarily imparts a "bias"'in the sense that it affects inferences in an undesirable way.After all,the primary reason for publishing scientific discoveries is to add to a store of common knowledge on which future research may build. But when scientific discovery is statistical in nature,we must weigh the significance of newly discovered relations in view of past infer- ences.This is recognized implicitly in many formal statistical circum- stances,as in the theory of sequential hypothesis testing.But it is considerably more difficult to correct for the effects of specification searches in practice since such searches often consist of sequences of empirical studies undertaken by many individuals over many years.2 For example,as a consequence of the many investigations relating the behavior of stock returns to size,Chen,Roll,and Ross (1986,p. 394)write:"It has been facetiously noted that size may be the best theory we now have of expected returns.Unfortunately,this is less of a theory than an empirical observation."Then,as Merton (1987, p.107)asks in a related context:"Is it reasonable to use the standard t-statistic as a valid measure of significance when the test is conducted on the same data used by many earlier studies whose results influ- enced the choice of theory to be tested?"We rephrase this question Statisticians have considered a closely related problem,known as the "fle drawer problem,"in which the overall significance of several published studies must be assessed while accounting for the possibility of unreported insignificant studies languishing in various investigators'file drawers. An excellent review of the file drawer problem and its remedies,which has come to be known as 'meta-analysis,"is provided by lyengar and Greenhouse (1988). 433
Tbe Review of Financial Studies/v 3n 3 1990 in the following way:Are standard tests of significance valid when the construction of the test statistics is influenced by empirical rela- tions derived from the very same data to be used in the test?Our results show that using prior information only marginally correlated with statistics of interest can distort inferences dramatically. In Section 1,we quantify the data-snooping biases associated with testing financial asset pricing models with portfolios formed by sort- ing on some empirically motivated characteristic.Using the theory of induced order statistics,we derive in closed form the asymptotic distribution of a commonly used test statistic before and after sorting. This not only yields a measure of the effect of data snooping,but also provides the appropriate sampling theory when snooping is unavoid- able.In Section 2,we report the results of Monte Carlo experiments designed to gauge the accuracy of the asymptotic approximations used in Section 1.In Section 3,two empirical examples are provided that illustrate the potential importance of data-snooping biases in existing tests of asset pricing models,and,in Section 4,we show how these biases can arise naturally from our tendency to focus on the unusual.We conclude in Section 5. 1. Quantifying Data-Snooping Biases With Induced Order Statis- tics Many tests of the CAPM and APT have been conducted on returns of groups of securities rather than on individual security returns,where the grouping is often according to some empirical characteristic of the securities.Perhaps the most common attribute by which securities are grouped is market value of equity or"size."The prevalence of size-sorted portfolios in recent tests of asset pricing models has not been precipitated by any economic theory linking size to asset prices. It is a consequence of a series of empirical studies demonstrating the statistical relation between size and the stochastic behavior of stock returns.3 Therefore,we must allow for our foreknowledge of size. related phenomena in evaluating the actual significance of tests per- formed on size-sorted portfolios.More generally,grouping securities by some characteristic that is empirically motivated may affect the size of the usual significance tests,particularly when the empirical motivation is derived from the very data set on which the test is based. See Banz (1978,1981),Brown,Kleidon,and Marsh (1983),and Chan,Chen,and Hsieh (1985) for example.Although Banz's(1978)original investigation may have been motivated by theoretical considerations,virtually all subsequent empirical studies exploiting the size effect do so because of Banz's empirical findings,and not his theory. Unfortunately the use of"size"to mean both market value of equity and type I error is unavoidable. Readers beware. 434
Data-Snooping Biases We quantify these effects in the following sections by appealing to asymptotic results for induced order statistics,and show that even mild forms of data snooping can change inferences substantially.In Section 1.1,a brief summary of the asymptotic properties of induced order statistics is provided.In Section 1.2,results for tests based on individual securities are presented,and in Section 1.3,corresponding results for portfolios are reported.We provide a more positive inter- pretation of data-snooping biases as power against deviations from the null hypothesis in Section 1.4. 1.1.Asymptotic properties of induced order statistics Since the particular form of data snooping we are investigating is most common in empirical tests of financial asset pricing models,our exposition will lie in that context.Suppose for each of N securities we have some consistent estimator &of a parameter a,which is to be used in the construction of an aggregate test statistic.For example, in the Sharpe-Lintner CAPM,@would be the estimated intercept from the following regression: Rn-Rn=:+(Rm-Rn)B:+en (1) where RR,and R are the period-t returns on security i,the market portfolio,and a risk-free asset,respectively.A test of the null hypothesis that a,=0 would then be a proper test of the Sharpe- Lintner version of the CAPM;thus,@may serve as a test statistic itself. However,more powerful tests may be obtained by combining the a,'s for many securities.But how should we combine them? Suppose for each security i we observe some characteristic X,,such as its out-of-sample market value of equity or average annual earnings, and we learn that X,is correlated empirically with a,.By this we mean that the relation between X,and @is an empirical fact uncovered by "searching"through the data,and not motivated by any a priori the. oretical considerations.This search need not be a systematic sifting of the data,but may be interpreted as any one of Leamer's (1978)six specification searches,which even the most meticulous of classical statisticians has conducted at some point.The key feature is that our interest in characteristic X:is derived from a look at the data,the same data to be used in performing our test.Common intuition sug- gests that using information contained in the X,'s can yield a more powerful test of economic restrictions on the a,'s.But if this char- acteristic is not a part of the original null hypothesis,and only catches our attention after a look at the data (or after a look at another's look at the data),using it to form our test statistics may lead us to reject those economic restrictions even when they obtain.More formally, 435
The Review of Financial Studies /v 3 n 3 1990 if we write a as a;=ai+Si (2) then it is evident that under the null hypothesis where a,=0,any correlation between X,and &must be due to correlation between the characteristic and estimation or measurement error 5.Although measurement error is usually assumed to be independent of all other relevant economic variables,the very process by which the charac- teristic comes to our attention may induce spurious correlation between X,and 5.We formalize this intuition in Section 4 and pro- ceed now to show that such spurious correlation has important impli- cations for testing the null hypothesis. This is most evident in the extreme case where the null hypothesis a,=0 is tested by performing a standard t-test on the largest of the a,'s.Clearly such a test is biased toward rejection unless we account for the fact that the largest &has been drawn from the set {a. Otherwise,extreme realizations of estimation error will be confused with a violation of the null hypothesis.If,instead of choosing a,by its value relative to other a's,our choice is based on some charac. teristic X,correlated with the estimation errors of @a similar bias might arise,albeit to a lesser degree. To formalize the preceding intuition,suppose that only a subset of n securities is used to form the test statistic and these n are chosen by sorting the X,'s.That is,let us reorder the bivariate vectors [X,a according to their first components,yielding the sequence (3) where Xi:w<Xz:w<···<XNN and the notation Xan follows that of the statistics literature in denoting the ith order statistic from the sample of N observations (X,).3 The notation w denotes the ith induced order statistic corresponding to XiN,or the ith concomitant of the order statistic XN That is,if the bivariate vectors [X,are ordered according to the X,entries,is defined to be the second component of the ith ordered vector.The &'s are not themselves s It is implicitly assumed throughout that both &and X,have continuous joint and marginal cumu- lative distribution functions;hence,strict inequalities suffice. sThe term concomitant of an order statistic was introduced by David (1973),who was perhaps the first to systematically investigate its properties and applications.The term fnduced order statistic was coined by Bhattacharya(1974)at about the same time.Although the former term seems to be more common usage,we use the latter in the interest of brevity.See Bhattacharya (1984)for an excellent review. 436
Data-Snooping Biases ordered but correspond to the ordering of the XN's.?For example, if X,is firm size and &is the intercept from a market-model regression of firm i's excess return on the excess market return,then is the a of the jth smallest of the N firms.We call this procedure induced ordering of the &'s. It is apparent that if we construct a test statistic by choosing n securities according to the ordering (3),the sampling theory cannot be the same as that of n securities selected independently of the data. From the following remarkably simple result by Yang (1977),an asymptotic sampling theory for test statistics based on induced order statistics may be derived analytically:8 Tbeorem 1.1.Let the vectors [X:al,i=1,...,N,be independently and identically distributed and let 1 ii<...<i<N be sequences of integers sucb that,as N-oo,ig/N-(O,1)(k=1, 2,...,n).Tben lim Pr(aa:w<a,··,anM<a,n) N-c =ΠPr(a。<ae|F(X)=), (4) fvl wbere F()is the marginal cumulative distribution function of X Proof See Yang (1977). This result gives the large-sample joint distribution of a finite subset of induced order statistics whose identities are determined solely by their relative rankings (as ranked according to the order statistics XN).From (4)it is evident that the's are mutually independent in large samples.If X,were the market value of equity of the ith company,Theorem 1.1 shows that the &of the security with size at, for example,the 27th percentile is asymptotically independent of the &,of the security with size at the 45th percentile,If the characteristics {X}and (are statistically independent,the joint distribution of If the vectors are independently and identically distributed and X,is perfectly correlated with thenw are also order statistics.But as long as the correlation coeficient p is strictly between -1 and 1,then,for example,will generally not be the largest & .See also David and Galambos (1974)and Watterson (1959).In fact,Yang(1977)provides the exact finite-sample distribution of any finite collection of induced order statistics,but even assuming bivariate normality does not yield a tractable form of this distribution. This is a limiting result and implies that the identities of the stocks with 27th and 45th percentile sizes will generally change as N increases. 437
The Review of Financial Studies/v 3 n 3 1990 the latter clearly cannot be influenced by ordering according to the former.It is tempting to conclude that as long as the correlation between X,and a,is economically small,induced ordering cannot greatly affect inferences.Using Yang's result we show the fallacy of this argument in Sections 1.2 and 1.3. 1.2 Biases of tests based on individual securities We evaluate the bias of induced ordering under the following assump- tion: (A)The vectors [X,(i=1,2,...,N are independently and identically distributed bivariate normal random vectors with mean [u a]',variance [o ol',and correlation pe(-1,1). The null hypothesis H is then H:a=0. Examples of asset pricing models that yield restrictions of this form are the Sharpe-Lintner CAPM and the exact factor pricing version of Ross's APT.10 Under this null hypothesis,the a,'s deviate from zero solely through estimation error. Since the sampling theory provided by Theorem 1.1 is asymptotic, we construct our test statistics using a finite subset of n securities where it is assumed that nN.If these securities are selected without the prior use of data,then we have the following well-known result: (5) where is any consistent estimator of o2.11 Therefore,a 5 percent test of H may be performed by checking whether 6 is greater or less than Cs,where C s is defined by Fx(Ca5)=.95 (6) and F()is the cumulative distribution function of a x?variate. Now suppose we construct 0 from the induced order statistics 0 See Chamberlain (1983),Huberman and Kandel (1987),Lehmann and Modest(1988),and Wang (1988)for further discussion of exact factor pricing models.Examples of tests that fit into the framework of H are those in Campbell (1987),Connor and Korajczyk (1988),Gibbons,Ross,and Shanken (1989),Huberman and Kandel (1987),Lehmann and Modest (1988),and MacKinlay (1987). In most contexts the consistency of is with respect to the number of time series observations T.In that case something must be said of the relative rates at which T and N increase without bound so as to guarantee convergence of 6.However,under H the parameter may be estimated cross-sectionally;hence,the relationin (5)need only represent N-asymptotics. 438
Data-Snooping Biases =1,...,n,instead of the &i's.Specifically,define the fol- lowing test statistic: =2品 (7) Using Theorem 1.1,the following proposition is easily established: Proposition 1.1.Under the null bypothesis H and assumption (A), as N increases witbout bound the induced order statistics(k= 1,...,n)converge in distribution to independent gaussian random variables with mean us and variance oi,wbere u三p(aa/ox)[Fx1(5)-4]=po.Φ-1(56), (8) σ房=2(1-p2), (9) wbicb implies a总(1-p2)xa(), (10) witb noncentrality parameter 2使e那, (11) wbere (is the standard normal cumulative distribution function. Proof.This follows directly from the definition of a noncentral chi- squared variate.The second equality in (8)follows from the fact that Φ(56)=F(E0x十x).■ Proposition 1.1 shows that the null hypothesis H is violated by induced ordering since the means of the ordered &'s are no longer zero. Indeed,the mean of w may be positive or negative depending on p and the (limiting)relative rank For example,if p =.10 and o =1,the mean of the induced order statistic in the 95th percentile is 0.164. The simplicity of 's asymptotic distribution follows from the fact that the 's become independent as N increases without bound. It follows from the fact that induced order statistics are conditionally independent when conditioned on the order statistics that determine the induced ordering.This seemingly counterintuitive result is easy to see when [X;il is bivariate normal,since,in this case a=a+p(o/ox[X,-】+Z, Z,i.i.d.N(0,σa(1-p2), (12) 439