Implementing Statistical Criteria to Select Return Forecasting Models festly nonstationary,or,if not,their behavior is close enough to unit-root nonstationary for small-sample statistics to be affected. We study an international sample of excess stock returns and candidate predictors which First Quadrant was kind enough to release to us.The time period nests that of another international study,Solnik(1993).Therefore, we also provide diagnostic tests that compare the two datasets (which are based on different sources). We discover ample evidence of predictability,confirming the conclusion of studies that were not based on formal model selection criteria.Usually only a few standard predictors are retained,however.Some of these are unit- root nonstationary (e.g.,dividend yield).Multiple lagged bond or stock returns are at times included,effectively generating the moving-average predictors that have become popular in professional circles lately [see also Brock,Lakonishok,and LeBaron (1992)and Sullivan,Timmermann,and White(1997)]. Formal model selection criteria guard against overfitting.The ultimate purpose is to obtain the model with the best external validity.In the context of prediction,this means that the retained model should provide good out- of-sample predictability.We test this on our dataset of international stock returns. Overall,we find no out-of-sample predictability.More specifically,none of the models that the selection criteria chose generates significant predic- tive power in the 5-year period beyond the initial ("training)sample.This conclusion is based on an SUR test of the slope coefficients in out-of-sample regressions of outcomes onto predictions across the different stock markets. The failure to detect out-of-sample predictability cannot be attributed to lack of power.Schwarz's Bayesian criterion,for instance,discovers predictabil- ity in 9 of 14 markets,with an average R2 of the retained models of 6%. Out of sample,however,none of the retained models generates significant forecasting power.Even with only nine samples of 60 months each,chances that this would occur if 6%were indeed the true R2 are less than I in 333. The poor external validity of the prediction models that formal model selection criteria chose indicates model nonstationarity:the parameters of the"best"prediction model change over time.It is an open question why this is.One potential explanation is that the "correct"prediction model is actually nonlinear,while our selection criteria chose exclusively among linear models.Still,these criteria pick the best linear prediction model:it is surprising that even this best forecaster does not work out of sample. As an explanation for the findings,however,model nonstationarity lacks economic content.It begs the question as to what generates this nonsta- tionarity.Pesaran and Timmermann(1995)also noticed that prediction per- formance improves if one switches models over time.They suggest that it reflects learning in the marketplace.Bossaerts(1997)investigates this possi- bility theoretically.He proves that evidence of predictability will disappear 407