Introduction to Nonparametric Analysis in Time Series Econometrics Yongmiao Hong 2020 1
Introduction to Nonparametric Analysis in Time Series Econometrics Yongmiao Hong 2020 1
This is Chapter 6 of a manuscript entitled as Modern Time Series Analysis:Theory and Applications written by the author.We will introduce some popular nonparametric methods,particularly the kernel smoothing method and the local polynomial smoothing method,to estimate functions of interest in time series contexts,such as probability density functions,autoregression functions,spectral density functions,and generalized spectral density functions.Empirical applications of these functions crucially depend on the consistent estimation of these functions.We will discuss the large sample statistical properties of nonparametric estimators in various contexts. Key words:Asymptotic normality,bias,boundary problem,consistency,curse of di- mensionality,density function,generalized spectral density,global smoothing,integrated mean squared error,law of large numbers,local polynomial smoothing,local smoothing, locally stationary time series model,mean squared error,kernel method,regression func- tion,series approximation,smoothing,spectral density function,Taylor series expansion, variance. Reading Materials and References This lecture note is self-contained.However,the following references will be useful for learning nonparametric analysis. (1)Nonparametric Analysis in Time Domain Silverman,B.(1986):Nonparametric Density Estimation and Data Analysis.Chap- man and Hall:London. Hardle,W.(1990):Applied Nonparametric Regression.Cambridge University Press:Cambridge. Fan,J.and Q.Yao (2003),Nonlinear Time Series:Parametric and Nonparametric Methods,Springer:New York. (2)Nonparametric Methods in Frequency Domain Priestley,M.(1981),Spectral Analysis and Time Series.Academic Press:New York. .Hannan,E.(1970),Multiple Time Series,John Wiley:New York 2
This is Chapter 6 of a manuscript entitled as Modern Time Series Analysis: Theory and Applications written by the author. We will introduce some popular nonparametric methods, particularly the kernel smoothing method and the local polynomial smoothing method, to estimate functions of interest in time series contexts, such as probability density functions, autoregression functions, spectral density functions, and generalized spectral density functions. Empirical applications of these functions crucially depend on the consistent estimation of these functions. We will discuss the large sample statistical properties of nonparametric estimators in various contexts. Key words: Asymptotic normality, bias, boundary problem, consistency, curse of dimensionality, density function, generalized spectral density, global smoothing, integrated mean squared error, law of large numbers, local polynomial smoothing, local smoothing, locally stationary time series model, mean squared error, kernel method, regression function, series approximation, smoothing, spectral density function, Taylor series expansion, variance. Reading Materials and References This lecture note is self-contained. However, the following references will be useful for learning nonparametric analysis. (1) Nonparametric Analysis in Time Domain Silverman, B. (1986): Nonparametric Density Estimation and Data Analysis. Chapman and Hall: London. H‰rdle, W. (1990): Applied Nonparametric Regression. Cambridge University Press: Cambridge. Fan, J. and Q. Yao (2003), Nonlinear Time Series: Parametric and Nonparametric Methods, Springer: New York. (2) Nonparametric Methods in Frequency Domain Priestley, M. (1981), Spectral Analysis and Time Series. Academic Press: New York. Hannan, E. (1970), Multiple Time Series, John Wiley: New York. 2
1 Motivation Suppose IXt}is a strictly stationary process with marginal probability density func- tion g()and pairwise joint probability density function fi(,y),and a random sample X of size T is observed.Then, .How to estimate the marginal pdf g(r)of [X:? .How to estimate the pairwise joint pdf fi(r,y)of (Xt,X)? How to estimate the autoregression function rj()=E(XX-j=z)? How to estimate the spectral density h(w)of [X}? .How to estimate the generalized spectral density f(w,u,v)of [X)? .How to estimate the bispectral density b(w1,w2)? How to estimate a nonlinear autoregressive conditional heteroskedastic model Xi=u(X:-1,...,Xi-p)+o(X:-1,...,Xi-q)Et,e}~i.i.d.(0,1) where u()and o()are unknown functions of the past information.Under certain regularity conditions,u()is the conditional mean of Xt given I1=[X-1,X-2,...} and o2()is the conditional variance of Xt given It-1. How to estimate a semi-nonparametric functional coefficient autoregressive process X=∑agX-X-+ E(el-1)=0a.s, =1 where ai()is unknown,and d>0 is a time lag parameter? How to estimate a nonparametric additive autoregressive process Xi= ∑,(X-)+et, E(et It-1)=0 a.s., j=1 where the ()functions are unknown? How to estimate a locally linear time-varying regression model Yi=XiB(t/T)+Et, where B(.)is an unknown smooth deterministic function of time?
1 Motivation Suppose fXtg is a strictly stationary process with marginal probability density function g(x) and pairwise joint probability density function fj (x; y); and a random sample fXtg T t=1 of size T is observed. Then, How to estimate the marginal pdf g(x) of fXtg? How to estimate the pairwise joint pdf fj (x; y) of (Xt ; Xtj )? How to estimate the autoregression function rj (x) = E(Xt jXtj = x)? How to estimate the spectral density h(!) of fXtg? How to estimate the generalized spectral density f(!; u; v) of fXtg? How to estimate the bispectral density b(!1; !2)? How to estimate a nonlinear autoregressive conditional heteroskedastic model Xt = (Xt1; :::; Xtp) + (Xt1; :::; Xtq)"t ; f"tg i:i:d:(0; 1); where () and () are unknown functions of the past information. Under certain regularity conditions, () is the conditional mean of Xt given It1 = fXt1; Xt2; :::g and 2 () is the conditional variance of Xt given It1. How to estimate a semi-nonparametric functional coe¢ cient autoregressive process Xt = X p j=1 j (Xtd)Xtj + "t ; E("t jIt1) = 0 a.s., where j () is unknown, and d > 0 is a time lag parameter? How to estimate a nonparametric additive autoregressive process Xt = X p j=1 j (Xtj ) + "t ; E("t jIt1) = 0 a.s., where the j () functions are unknown? How to estimate a locally linear time-varying regression model Yt = X 0 t(t=T) + "t ; where () is an unknown smooth deterministic function of time? 3
How to use these estimators in economic and financial applications? Nonparametric estimation is often called nonparametric smoothing,since a key parameter called smoothing parameter is used to control the degree of the estimated curve.Nonparametric smoothing first arose from spectral density estimation in time series analysis.In a discussion of the seminal paper by Bartlett (1946),Henry Daniels suggested that a possible improvement on spectral density estimation could be made by smoothing the periodogram(see Chapter 3),which is the squared discrete Fourier transform of the random sample {X.The theory and techniques were then system- atically developed by Bartlett (1948,1950).Thus,smoothing techniques were already prominently featured in time series analysis more than 70 years ago In the earlier stage of nonlinear time series analysis(see Tong(1990)),the focus was on various nonlinear parametric forms,such as threshold autoregressive models,smooth transition autoregressive models,and Regime-switch Markov chain autoregressive mod- els(see Chapter 8 for details).Recent interest has been mainly in nonparametric curve estimation,which does not require the knowledge of the functional form beyond certain smoothness conditions on the underlying function of interest. Question:Why is nonparametric smoothing popular in statistics and econometrics? There are several reasons for the popularity of nonparametric analysis.In particular, three main reasons are: Demands for nonlinear approaches; Availability of large data sets; Advance in computer technology. Indeed,as Granger (1999)points out,the speed in computing technology increases much faster than the speed at which data grows. To obtain basic ideas about nonparametric smoothing methods,we now consider two examples,one is the estimation of a regression function,and the other is the estimation of a probability density function. ¥
How to use these estimators in economic and Önancial applications? Nonparametric estimation is often called nonparametric smoothing, since a key parameter called smoothing parameter is used to control the degree of the estimated curve. Nonparametric smoothing Örst arose from spectral density estimation in time series analysis. In a discussion of the seminal paper by Bartlett (1946), Henry Daniels suggested that a possible improvement on spectral density estimation could be made by smoothing the periodogram (see Chapter 3), which is the squared discrete Fourier transform of the random sample fXtg T t=1. The theory and techniques were then systematically developed by Bartlett (1948,1950). Thus, smoothing techniques were already prominently featured in time series analysis more than 70 years ago. In the earlier stage of nonlinear time series analysis (see Tong (1990)), the focus was on various nonlinear parametric forms, such as threshold autoregressive models, smooth transition autoregressive models, and Regime-switch Markov chain autoregressive models (see Chapter 8 for details). Recent interest has been mainly in nonparametric curve estimation, which does not require the knowledge of the functional form beyond certain smoothness conditions on the underlying function of interest. Question: Why is nonparametric smoothing popular in statistics and econometrics? There are several reasons for the popularity of nonparametric analysis. In particular, three main reasons are: Demands for nonlinear approaches; Availability of large data sets; Advance in computer technology. Indeed, as Granger (1999) points out, the speed in computing technology increases much faster than the speed at which data grows. To obtain basic ideas about nonparametric smoothing methods, we now consider two examples, one is the estimation of a regression function, and the other is the estimation of a probability density function. 4
Example 1 Regression Function:Consider the first order autoregression function r1(x)=E(X:Xi-1=x). We can write Xt=ri(Xi-1)+t, where E(etX:-1)=0 by construction.We assume E(X?)<oo. Suppose a sequence of bases (r)}constitutes a complete orthonormal basis for the space of square-integrable functions.Then we can always decompose the function where the Fourier coefficient rn(e),(e), which is the projection of ri(r)on the base i(). Suppose there is a quadratic function ri(z)-x2forx∈【-元,x.Then r1(x)= 2 34 cos(ar)-o s(2y)+os(3d- 32 π2 4∑(-1p-1os0四 j=1 For another example,suppose the regression function is a step function,namely -1if-π<x<0, r(x) 0 if =0, 1if0<x<π. Then we can still expand it as an infinite sum of periodic series, n()= 4 sin(e)sin)sin( 3 5 4 户m2j+1网 (2j+1) 5
Example 1 [Regression Function]: Consider the Örst order autoregression function r1(x) = E(Xt jXt1 = x): We can write Xt = r1(Xt1) + "t ; where E("t jXt1) = 0 by construction. We assume E(X2 t ) >>: 1 if < x < 0; 0 if x = 0; 1 if 0 < x < : Then we can still expand it as an inÖnite sum of periodic series, r1(x) = 4 sin(x) + sin(3x) 3 + sin(5x) 5 + = 4 X1 j=0 sin[(2j + 1)x] (2j + 1) : 5
In general,we do not assume that the function form of ri(r)is known,except that we still maintain the assumption that ri(c)is a square-integrable function.Because ri(r) is square-integrable,we have ri(x)dz ∑∑aat vi(r)v(x)dx j=0k=0 0000 ∑∑by orthonormality j=0k=0 00 ∑<, j= where oj.k is the Kronecker delta function:6ik=1 if j=k and 0 otherwise. The squares summability implies aj-0 as j-oo,that is,aj becomes less impor- tant as the order j-oo.This suggests that a truncated sum rnp)=∑a, j=0 can be used to approximate ri(x)arbitrarily well if p is sufficiently large.The approxi- mation error,or the bias, b(x)三ri(x)-rnip(x) = ∑a,() j=p+1 →0 asp→o. However,the coefficient a;is unknown.To obtain a feasible estimator for ri(r),we consider the following sequence of truncated regression models X=∑B,,(X-i)+ct, j=0 where p=p(T)-oo is the number of series terms that depends on the sample size T. We need p/T-0 as T-oo,i.e.,the number of p is much smaller than the sample size T.Note that the regression error Ept is not the same as the true innovation et for each given p.Instead,it contains the true innovation et and the bias bp(X:-1). 6
In general, we do not assume that the function form of r1(x) is known, except that we still maintain the assumption that r1(x) is a square-integrable function. Because r1(x) is square-integrable, we have Z 1 1 r 2 1 (x)dx = X1 j=0 X1 k=0 jk Z 1 1 j (x) k (x)dx = X1 j=0 X1 k=0 jkj;k by orthonormality = X1 j=0 2 j < 1; where j;k is the Kronecker delta function: j;k = 1 if j = k and 0 otherwise. The squares summability implies j ! 0 as j ! 1; that is, j becomes less important as the order j ! 1. This suggests that a truncated sum r1p(x) = X p j=0 j j (x) can be used to approximate r1(x) arbitrarily well if p is su¢ ciently large. The approximation error, or the bias, bp(x) r1(x) r1p(x) = X1 j=p+1 j j (x) ! 0 as p ! 1: However, the coe¢ cient j is unknown. To obtain a feasible estimator for r1(x); we consider the following sequence of truncated regression models Xt = X p j=0 j j (Xt1) + "pt; where p p(T) ! 1 is the number of series terms that depends on the sample size T: We need p=T ! 0 as T ! 1, i.e., the number of p is much smaller than the sample size T. Note that the regression error "pt is not the same as the true innovation "t for each given p: Instead, it contains the true innovation "t and the bias bp(Xt1): 6
The ordinary least squares estimator =(亚'亚)-1亚X T t=2 where 亚=(i,,r isaT×p matrix,and :=[o(Xt-1),1(X-1,,少(Xt-1)' is a p x 1 vector.The series-based regression estimator is fpl)=∑月,9g(). j=0 To ensure that fip(r)is asymptotically unbiased,we must let p=p(T)-oo as T-oo (e.g.,p=VT).However,if p is too large,the number of estimated parameters will be too large,and as a consequence,the sampling variation of B will be large (i.e.,the estimator B is imprecise.)We must choose an appropriate p=P(T)so as to balance the bias and the sampling variation.The truncation order p is called a smoothing parameter because it controls the smoothness of the estimated function fip().In general,for any given sample,a large p will give a smooth estimated curve whereas a small p will give a wiggly estimated curve.If p is too large such that the variance of fip(r)is larger than its squared bias,we call that there exists oversmoothing.In contrast,if p is too sall such that the variance of fp()is smaller than its squared bias,then we call that there exists undersmoothing.Optimal smoothing is achieved when the variance of fip(r)balances its squared bias.The series estimatorfip()is called a global smoothing method,because once p is given,the estimated function fp()is determined over the entire domain of Xi. Under suitable regularity conditions,fip(r)will consistently estimate the unknown function ri(t)as the sample size T increases.This is called nonparametric estimation because no parametric functional form is imposed on ri(x). The base functions )can be the Fourier series (i.e.,the sin and cosine func- tions),and B-spline functions if X has a bounded support.See (e.g.)Andrews (1991, Econometrica)and Hong and White (1995,Econometrica)for applications. 7
The ordinary least squares estimator ^ = ( 0 )1 0X = X T t=2 t 0 t !1 X T t=2 tXt ; where = ( 0 1 ; :::; 0 T ) 0 is a T p matrix, and t = [ 0 (Xt1); 1 (Xt1); :::; p (Xt1)]0 is a p 1 vector. The series-based regression estimator is r^1p(x) = X p j=0 ^ j j (x): To ensure that r^1p(x) is asymptotically unbiased, we must let p = p(T) ! 1 as T ! 1 (e.g., p = p T): However, if p is too large, the number of estimated parameters will be too large, and as a consequence, the sampling variation of ^ will be large (i.e., the estimator ^ is imprecise.) We must choose an appropriate p = P(T) so as to balance the bias and the sampling variation. The truncation order p is called a smoothing parameter because it controls the smoothness of the estimated function r^1p(x): In general, for any given sample, a large p will give a smooth estimated curve whereas a small p will give a wiggly estimated curve. If p is too large such that the variance of r^1p(x) is larger than its squared bias, we call that there exists oversmoothing. In contrast, if p is too sall such that the variance of r^1p(x) is smaller than its squared bias, then we call that there exists undersmoothing. Optimal smoothing is achieved when the variance of r^1p(x) balances its squared bias. The series estimator r^1p(x) is called a global smoothing method, because once p is given, the estimated function r^1p(x) is determined over the entire domain of Xt : Under suitable regularity conditions, r^1p(x) will consistently estimate the unknown function r1(x) as the sample size T increases. This is called nonparametric estimation because no parametric functional form is imposed on r1(x): The base functions f j ()g can be the Fourier series (i.e., the sin and cosine functions), and B-spline functions if Xt has a bounded support. See (e.g.) Andrews (1991, Econometrica) and Hong and White (1995, Econometrica) for applications. 7
Example 2 Probability Density Function]:Suppose the PDF g(r)of Xt is a smooth function with unbounded support.We can expand g(x)=(e)B,H(e), j=0 where the function 1 =V2元p(-2) is the N(0,1)density function,and [H()}is the sequence of Hermite polynomials, defined as (-1yΦ(@)=-耳-=e)p()forj>0 where (is the N(0,1)CDF.For example, H(x)=1, H1(x)=x, H2(x)=(x2-1) H3(x)=x(x2-3), H4(x)=x4-6x2+3. See,for example,Magnus,Oberhettinger and Soni (1966,Section 5.6)and Abramowitz and Stegun (1972,Ch.22). Here,the Fourier coefficient g(x)Hj(x)o(x)dz. Again,,月,一0asj一ogiven∑go号<oo. The N(0,1)PDF o(r)is the leading term to approximate the unknown density g(x), and the Hermite polynomial series will capture departures from normality(e.g.,skewness and heavy tails). To estimate g(r),we can consider the sequence of truncated probability densities gn(c)=Cp(x)月,H(c, i=0 where the constant Hj(x)o(z)dr
Example 2 [Probability Density Function]: Suppose the PDF g(x) of Xt is a smooth function with unbounded support. We can expand g(x) = (x) X1 j=0 jHj (x); where the function (x) = 1 p 2 exp( 1 2 x 2 ) is the N(0; 1) density function, and fHj (x)g is the sequence of Hermite polynomials, deÖned as (1)j d j dxj (x) = Hj1(x)(x) for j > 0; where () is the N(0; 1) CDF. For example, H0(x) = 1; H1(x) = x; H2(x) = (x 2 1) H3(x) = x(x 2 3); H4(x) = x 4 6x 2 + 3: See, for example, Magnus, Oberhettinger and Soni (1966, Section 5.6) and Abramowitz and Stegun (1972, Ch.22). Here, the Fourier coe¢ cient j = Z 1 1 g(x)Hj (x)(x)dx: Again, j ! 0 as j ! 1 given P1 j=0 2 j < 1: The N(0; 1) PDF (x) is the leading term to approximate the unknown density g(x), and the Hermite polynomial series will capture departures from normality (e.g., skewness and heavy tails). To estimate g(x); we can consider the sequence of truncated probability densities gp(x) = C 1 p (x) X p j=0 jHj (x); where the constant Cp = X p j=0 j Z Hj (x)(x)dx 8
is a normalization factor to ensure that gp(r)is a PDF for each p.The unknown pa- rameters (can be estimated from the sample via the maximum likelihood estimation (MLE)method.For example,suppose {Xt}is an IID sample.Then T 3=arg max∑n9p(X) t=1 To ensure that p(m)=Cg()∑-o3,H) is asymptotically unbiased,we must let p =p(T)oo as T-oo.However,p must grow more slowly than the sample size T grows to infinity so that the sampling variation of B will not be too large. For the use of Hermite Polynomial series expansions,see (e.g.)Gallant and Tauchen (1996,Econometric Theory),Ait-Sahalia (2002,Econometrica),and Cui,Hong and Li (2020) Question:What are the advantages of nonparametric smoothing methods? They require few assumptions or restrictions on the data generating process.In particular,they do not assume a specific functional form for the function of interest (of course certain smoothness condition such as differentiability is required).They can deliver a consistent estimator for the unknown function,no matter whether it is linear or nonlinear.Thus,nonparametric methods can effectively reduce potential systematic bi- ases due to model misspecification,which is more likely to be encountered for parametric modeling. Question:What are the disadvantages of nonparametric methods? Nonparametric methods require a large data set for reasonable estimation.Fur- thermore,there exists a notorious problem of "curse of dimensionality,"when the function of interest contains multiple explanatory variables.This will be explained below. There exists another notorious "boundary effect"problem for nonparametric esti- mation near the boundary regions of the support.This occurs due to asymmetric coverage of data in the boundary regions. 9
is a normalization factor to ensure that gp(x) is a PDF for each p: The unknown parameters fjg can be estimated from the sample fXtg T t=1 via the maximum likelihood estimation (MLE) method. For example, suppose fXtg is an IID sample. Then ^ = arg max X T t=1 ln ^gp(Xt) To ensure that g^p(x) = C^1 p (x) Xp j=0^ jHj (x) is asymptotically unbiased, we must let p = p(T) ! 1 as T ! 1: However, p must grow more slowly than the sample size T grows to inÖnity so that the sampling variation of ^ will not be too large. For the use of Hermite Polynomial series expansions, see (e.g.) Gallant and Tauchen (1996, Econometric Theory), AÔt-Sahalia (2002, Econometrica), and Cui, Hong and Li (2020). Question: What are the advantages of nonparametric smoothing methods? They require few assumptions or restrictions on the data generating process. In particular, they do not assume a speciÖc functional form for the function of interest (of course certain smoothness condition such as di§erentiability is required). They can deliver a consistent estimator for the unknown function, no matter whether it is linear or nonlinear. Thus, nonparametric methods can e§ectively reduce potential systematic biases due to model misspeciÖcation, which is more likely to be encountered for parametric modeling. Question: What are the disadvantages of nonparametric methods? Nonparametric methods require a large data set for reasonable estimation. Furthermore, there exists a notorious problem of ìcurse of dimensionality,îwhen the function of interest contains multiple explanatory variables. This will be explained below. There exists another notorious ìboundary e§ectîproblem for nonparametric estimation near the boundary regions of the support. This occurs due to asymmetric coverage of data in the boundary regions. 9
Coefficients are usually difficult to interpret from an economic point of view. There exists a danger of potential overfitting,in the sense that nonparametric method,due to its flexibility,tends to capture non-essential features in a data which will not appear in out-of-sample scenarios. The above two motivating examples are the so-called orthogonal series expansion methods.There are other nonparametric methods,such as splines smoothing,kernel smoothing,k-near neighbor,and local polynomial smoothing.As mentioned earlier, series expansion methods are examples of so-called global smoothing,because the coefficients are estimated using all observations,and they are then used to evaluate the values of the underlying function over all points in the support of Xt.A nonparametric series model is an increasing sequence of parametric models,as the sample size T grows. In this sense,it is also called a sieve estimator.In contrast,kernel and local polynomial methods are examples of the so-called local smoothing methods,because estimation only requires the observations in a neighborhood of the point of interest.Below we will mainly focus on kernel and local polynomial smoothing methods,due to their simplicity and intuitive nature. 2 Kernel Density Method 2.1 Univariate Density Estimation Suppose IXt}is a strictly stationary time series process with unknown marginal PDF g(x). Question:How to estimate the marginal PDF g(r)of the time series process [X)? We first consider a parametric approach.Assume that g(r)is an N(u,o2)PDF with unknown u and o2.Then we know the functional form of g()up to two unknown parameters 0 =(u,o2)': -a -, -00<x<00 10
Coe¢ cients are usually di¢ cult to interpret from an economic point of view. There exists a danger of potential overÖtting, in the sense that nonparametric method, due to its áexibility, tends to capture non-essential features in a data which will not appear in out-of-sample scenarios. The above two motivating examples are the so-called orthogonal series expansion methods. There are other nonparametric methods, such as splines smoothing, kernel smoothing, k-near neighbor, and local polynomial smoothing. As mentioned earlier, series expansion methods are examples of so-called global smoothing, because the coe¢ cients are estimated using all observations, and they are then used to evaluate the values of the underlying function over all points in the support of Xt . A nonparametric series model is an increasing sequence of parametric models, as the sample size T grows. In this sense, it is also called a sieve estimator. In contrast, kernel and local polynomial methods are examples of the so-called local smoothing methods, because estimation only requires the observations in a neighborhood of the point of interest. Below we will mainly focus on kernel and local polynomial smoothing methods, due to their simplicity and intuitive nature. 2 Kernel Density Method 2.1 Univariate Density Estimation Suppose fXtg is a strictly stationary time series process with unknown marginal PDF g(x): Question: How to estimate the marginal PDF g(x) of the time series process fXtg? We Örst consider a parametric approach. Assume that g(x) is an N(; 2 ) PDF with unknown and 2 : Then we know the functional form of g(x) up to two unknown parameters = (; 2 ) 0 : g(x; ) = 1 p 22 exp 1 2 2 (x ) 2 ; 1 < x < 1: 10