
Introduction to Nonparametric Analysisin Time Series EconometricsYongmiao Hong2020一
Introduction to Nonparametric Analysis in Time Series Econometrics Yongmiao Hong 2020 1

This is Chapter 6 of a manuscript entitled as Modern Time Series Analysis: Theoryand Applications written by theauthor.We will introduce some popular nonparametricmethods,particularly thekernel smoothing method and the local polynomial smoothingmethod, to estimate functions of interest in time series contexts, such as probabilitydensity functions, autoregression functions, spectral density functions, and generalizedspectral density functions. Empirical applications of these functions crucially depend onthe consistent estimation of these functions. We will discuss the large sample statisticalproperties of nonparametric estimators in various contexts.Key words:Asymptotic normality, bias, boundary problem, consistency, curse of di-mensionality,density function,generalized spectral density,global smoothing, integratedmean squared error, law of large numbers, local polynomial smoothing, local smoothing,locally stationary time series model, mean squared error, kernel method, regression func-tion, series approximation, smoothing, spectral density function, Taylor series expansion,variance.Reading Materials and ReferencesThis lecture note is self-contained. However, the following references will be usefulforlearningnonparametricanalysis.(1) Nonparametric Analysis in Time Domain. Silverman, B. (1986): Nonparametric Density Estimation and Data Analysis. Chap-man and Hall: London.·Hardle, W. (1990): Applied Nonparametric Regression. Cambridge UniversityPress: Cambridge. Fan, J. and Q. Yao (2003), Nonlinear Time Series: Parametric and NonparametricMethods, Springer: New York.(2) Nonparametric Methods in Frequency Domain.Priestley,M.(1981),Spectral Analysis and Time Series.Academic Press:NewYork. Hannan, E. (1970), Multiple Time Series, John Wiley: New York.2
This is Chapter 6 of a manuscript entitled as Modern Time Series Analysis: Theory and Applications written by the author. We will introduce some popular nonparametric methods, particularly the kernel smoothing method and the local polynomial smoothing method, to estimate functions of interest in time series contexts, such as probability density functions, autoregression functions, spectral density functions, and generalized spectral density functions. Empirical applications of these functions crucially depend on the consistent estimation of these functions. We will discuss the large sample statistical properties of nonparametric estimators in various contexts. Key words: Asymptotic normality, bias, boundary problem, consistency, curse of dimensionality, density function, generalized spectral density, global smoothing, integrated mean squared error, law of large numbers, local polynomial smoothing, local smoothing, locally stationary time series model, mean squared error, kernel method, regression function, series approximation, smoothing, spectral density function, Taylor series expansion, variance. Reading Materials and References This lecture note is self-contained. However, the following references will be useful for learning nonparametric analysis. (1) Nonparametric Analysis in Time Domain Silverman, B. (1986): Nonparametric Density Estimation and Data Analysis. Chapman and Hall: London. H‰rdle, W. (1990): Applied Nonparametric Regression. Cambridge University Press: Cambridge. Fan, J. and Q. Yao (2003), Nonlinear Time Series: Parametric and Nonparametric Methods, Springer: New York. (2) Nonparametric Methods in Frequency Domain Priestley, M. (1981), Spectral Analysis and Time Series. Academic Press: New York. Hannan, E. (1970), Multiple Time Series, John Wiley: New York. 2

1 MotivationSuppose [xt] is a strictly stationary process with marginal probability density func-tion g(r) and pairwise joint probability density function f;(r, y), and a random sample[Xt]T-, of size T is observed. Then,. How to estimate the marginal pdf g(r) of [Xt)?. How to estimate the pairwise joint pdf fi(r, y) of (Xt, Xt-j)?. How to estimate the autoregression function r;(r) = E(Xt/Xt-j = r)?. How to estimate the spectral density h(w) of [Xt]?. How to estimate the generalized spectral density f(w, u, ) of [xt]?. How to estimate the bispectral density b(wi, w2)?.Howto estimateanonlinearautoregressiveconditional heteroskedasticmodelXt = μ(Xt-1, ., Xt-p) + o(Xt-1,., Xt-g)et,[et) ~ i.i.d.(0, 1),where μ() and o() are unknown functions of the past information. Under certainregularity conditions, μ() is the conditional mean of Xt given It-1 = {Xt-1, Xt-2, ..and o2() is the conditional variance of Xt given It-1..How to estimate a semi-nonparametric functional coefficient autoregressive processPXt=Ea;(Xt-d)Xt-j+Et,E(et/It-1) = 0 a.s.,j=1where a;()is unknown,and d>0 is a time lag parameter?. How to estimate a nonparametric additive autoregressive processPXt = Zu;(Xt-i)+et,E(et/It-1) = 0 a.s.,j=1where the μ,() functions are unknown?· How to estimate a locally linear time-varying regression modelYt=XB(t/T)+Etwhereβ()isan unknown smooth deterministicfunction of time?3
1 Motivation Suppose fXtg is a strictly stationary process with marginal probability density function g(x) and pairwise joint probability density function fj (x; y); and a random sample fXtg T t=1 of size T is observed. Then, How to estimate the marginal pdf g(x) of fXtg? How to estimate the pairwise joint pdf fj (x; y) of (Xt ; Xtj )? How to estimate the autoregression function rj (x) = E(Xt jXtj = x)? How to estimate the spectral density h(!) of fXtg? How to estimate the generalized spectral density f(!; u; v) of fXtg? How to estimate the bispectral density b(!1; !2)? How to estimate a nonlinear autoregressive conditional heteroskedastic model Xt = (Xt1; :::; Xtp) + (Xt1; :::; Xtq)"t ; f"tg i:i:d:(0; 1); where () and () are unknown functions of the past information. Under certain regularity conditions, () is the conditional mean of Xt given It1 = fXt1; Xt2; :::g and 2 () is the conditional variance of Xt given It1. How to estimate a semi-nonparametric functional coe¢ cient autoregressive process Xt = X p j=1 j (Xtd)Xtj + "t ; E("t jIt1) = 0 a.s., where j () is unknown, and d > 0 is a time lag parameter? How to estimate a nonparametric additive autoregressive process Xt = X p j=1 j (Xtj ) + "t ; E("t jIt1) = 0 a.s., where the j () functions are unknown? How to estimate a locally linear time-varying regression model Yt = X 0 t(t=T) + "t ; where () is an unknown smooth deterministic function of time? 3

How to use these estimators in economic and financial applications?Nonparametric estimation is often called nonparametric smoothing, since a keyparameter called smoothing parameter is used to control the degree of the estimatedcurve. Nonparametric smoothing first arose from spectral density estimation in timeseries analysis. In a discussion of the seminal paper by Bartlett (1946), Henry Danielssuggested that a possible improvement on spectral density estimation could be madeby smoothing the periodogram (see Chapter 3), which is the squared discrete Fouriertransform of the random sample [X,iT-,.The theory and techniques were then system-atically developed by Bartlett (1948,1950).Thus, smoothing techniques were alreadyprominently featured in time series analysis more than 70 years ago.In the earlier stage of nonlinear time series analysis (see Tong (1990)), the focus wason various nonlinear parametric forms, such as threshold autoregressive models, smoothtransition autoregressive models, and Regime-switch Markov chain autoregressive mod-els (see Chapter 8 for details). Recent interest has been mainly in nonparametric curveestimation, which does not require the knowledge of the functional form beyond certainsmoothness conditions on the underlying function of interest.Question: Why is nonparametric smoothing popular in statistics and econometrics?There are several reasons for the popularity of nonparametric analysis. In particular.three main reasons are:. Demands for nonlinear approaches;.Availability of large data sets;.Advance in computer technologyIndeed, as Granger (1999) points out, the speed in computing technology increasesmuch faster than the speed at which data grows.To obtain basic ideas about nonparametric smoothing methods, we now consider twoexamples, one is the estimation of a regression function, and the other is the estimationof a probability density function.4
How to use these estimators in economic and Önancial applications? Nonparametric estimation is often called nonparametric smoothing, since a key parameter called smoothing parameter is used to control the degree of the estimated curve. Nonparametric smoothing Örst arose from spectral density estimation in time series analysis. In a discussion of the seminal paper by Bartlett (1946), Henry Daniels suggested that a possible improvement on spectral density estimation could be made by smoothing the periodogram (see Chapter 3), which is the squared discrete Fourier transform of the random sample fXtg T t=1. The theory and techniques were then systematically developed by Bartlett (1948,1950). Thus, smoothing techniques were already prominently featured in time series analysis more than 70 years ago. In the earlier stage of nonlinear time series analysis (see Tong (1990)), the focus was on various nonlinear parametric forms, such as threshold autoregressive models, smooth transition autoregressive models, and Regime-switch Markov chain autoregressive models (see Chapter 8 for details). Recent interest has been mainly in nonparametric curve estimation, which does not require the knowledge of the functional form beyond certain smoothness conditions on the underlying function of interest. Question: Why is nonparametric smoothing popular in statistics and econometrics? There are several reasons for the popularity of nonparametric analysis. In particular, three main reasons are: Demands for nonlinear approaches; Availability of large data sets; Advance in computer technology. Indeed, as Granger (1999) points out, the speed in computing technology increases much faster than the speed at which data grows. To obtain basic ideas about nonparametric smoothing methods, we now consider two examples, one is the estimation of a regression function, and the other is the estimation of a probability density function. 4

Example 1 [Regression Function]: Consider the first order autoregression functionri(r)= E(XtXt-1 = r).We can writeXt= ri(Xt-1) +Et,where E(et/Xt-1) = 0 by construction. We assume E(X?) 3j2j=1For another example, suppose the regression function is a step function, namely-1if-π(2j + 1)元j=05
Example 1 [Regression Function]: Consider the Örst order autoregression function r1(x) = E(Xt jXt1 = x): We can write Xt = r1(Xt1) + "t ; where E("t jXt1) = 0 by construction. We assume E(X2 t ) >>: 1 if < x < 0; 0 if x = 0; 1 if 0 < x < : Then we can still expand it as an inÖnite sum of periodic series, r1(x) = 4 sin(x) + sin(3x) 3 + sin(5x) 5 + = 4 X1 j=0 sin[(2j + 1)x] (2j + 1) : 5

In general, we do not assume that the function form of ri(r) is known, except that westill maintain the assumption that ri(r) is a square-integrable function. Because ri(r)is square-integrable, we haveajakri(r)dr =w;(r)wk(r)drj=0 k=08080Zajaxdjiby orthonormality3=0 ,=080Na?8j=0where dik is the Kronecker delta function:dik=1if j=k and o otherwiseThe squares summability implies aj → 0 as j → co, that is, aj becomes less impor-tant as the order j oo.This suggests that a truncated sumPrip(r) =Eab,(r)j=0can be used to approximate ri(r) arbitrarily well if p is sufficiently large. The approxi-mation error,orthebiasbp(r)=ri(c)-T1p(c) ajd;(r)j=p+10asp→8.However, the coefficient Q, is unknown. To obtain a feasible estimator for ri(r), weconsiderthefollowingsequenceoftruncatedregressionmodelsXt =,Φ;(Xt-1)+ept,j=0where p = p(T) -→ oo is the number of series terms that depends on the sample size T.We need p/T → 0 as T → oo, i.e., the number of p is much smaller than the sample sizeT. Note that the regression error Ept is not the same as the true innovation et for eachgiven p. Instead, it contains the true innovation et and the bias bp(Xt-1).6
In general, we do not assume that the function form of r1(x) is known, except that we still maintain the assumption that r1(x) is a square-integrable function. Because r1(x) is square-integrable, we have Z 1 1 r 2 1 (x)dx = X1 j=0 X1 k=0 jk Z 1 1 j (x) k (x)dx = X1 j=0 X1 k=0 jkj;k by orthonormality = X1 j=0 2 j < 1; where j;k is the Kronecker delta function: j;k = 1 if j = k and 0 otherwise. The squares summability implies j ! 0 as j ! 1; that is, j becomes less important as the order j ! 1. This suggests that a truncated sum r1p(x) = X p j=0 j j (x) can be used to approximate r1(x) arbitrarily well if p is su¢ ciently large. The approximation error, or the bias, bp(x) r1(x) r1p(x) = X1 j=p+1 j j (x) ! 0 as p ! 1: However, the coe¢ cient j is unknown. To obtain a feasible estimator for r1(x); we consider the following sequence of truncated regression models Xt = X p j=0 j j (Xt1) + "pt; where p p(T) ! 1 is the number of series terms that depends on the sample size T: We need p=T ! 0 as T ! 1, i.e., the number of p is much smaller than the sample size T. Note that the regression error "pt is not the same as the true innovation "t for each given p: Instead, it contains the true innovation "t and the bias bp(Xt1): 6

The ordinary least squares estimatorβ=()-X>wXt1=2whereI= (1,., W)is a Tx pmatrix,andbt = [vo(Xt-1), i(Xt-1), .., ,(Xt-1)]lis a p × 1 vector. The series-based regression estimator isrip(r)=B,b(a).-0To ensure that fip(r) is asymptotically unbiased, we must let p = p(T) → 00 as T → 00(e.g., p = VT). However, if p is too large, the number of estimated parameters willbe too large, and as a consequence, the sampling variation of β will be large (i.e., theestimatorβis imprecise.)Wemustchoosean appropriatep=P(T)soastobalancethebias and the sampling variation.The truncation order p is called a smoothing parameterbecause it controls the smoothness of the estimated function fip(r).In general, for anygiven sample, a large p will give a smooth estimated curve whereas a small p will give awiggly estimated curve. If p is too large such that the variance of rip(r) is larger thanits squared bias, we call that there exists oversmoothing. In contrast, if p is too sall suchthat the variance of rip(r) is smaller than its squared bias, then we call that there existsundersmoothing. Optimal smoothing is achieved when the variance of rip(r) balances itssquared bias. The series estimator fip(r) is called a global smoothing method, becauseonce p is given, the estimated function fFip(r) is determined over the entire domain ofXtUnder suitable regularity conditions, fip(c) will consistently estimate the unknownfunction ri(r) as the sample size T increases. This is called nonparametric estimationbecause noparametric functional form is imposed on ri(r).The base functions {b;()) can be the Fourier series (i.e., the sin and cosine func-tions), and B-spline functions if Xt has a bounded support. See (e.g.) Andrews (1991,Econometrica)andHongandWhite(1995,Econometrica)forapplications.7
The ordinary least squares estimator ^ = ( 0 )1 0X = X T t=2 t 0 t !1 X T t=2 tXt ; where = ( 0 1 ; :::; 0 T ) 0 is a T p matrix, and t = [ 0 (Xt1); 1 (Xt1); :::; p (Xt1)]0 is a p 1 vector. The series-based regression estimator is r^1p(x) = X p j=0 ^ j j (x): To ensure that r^1p(x) is asymptotically unbiased, we must let p = p(T) ! 1 as T ! 1 (e.g., p = p T): However, if p is too large, the number of estimated parameters will be too large, and as a consequence, the sampling variation of ^ will be large (i.e., the estimator ^ is imprecise.) We must choose an appropriate p = P(T) so as to balance the bias and the sampling variation. The truncation order p is called a smoothing parameter because it controls the smoothness of the estimated function r^1p(x): In general, for any given sample, a large p will give a smooth estimated curve whereas a small p will give a wiggly estimated curve. If p is too large such that the variance of r^1p(x) is larger than its squared bias, we call that there exists oversmoothing. In contrast, if p is too sall such that the variance of r^1p(x) is smaller than its squared bias, then we call that there exists undersmoothing. Optimal smoothing is achieved when the variance of r^1p(x) balances its squared bias. The series estimator r^1p(x) is called a global smoothing method, because once p is given, the estimated function r^1p(x) is determined over the entire domain of Xt : Under suitable regularity conditions, r^1p(x) will consistently estimate the unknown function r1(x) as the sample size T increases. This is called nonparametric estimation because no parametric functional form is imposed on r1(x): The base functions f j ()g can be the Fourier series (i.e., the sin and cosine functions), and B-spline functions if Xt has a bounded support. See (e.g.) Andrews (1991, Econometrica) and Hong and White (1995, Econometrica) for applications. 7

Example 2 [Probability Density Function]: Suppose the PDF g(r) of Xt is asmooth function with unbounded support. We can expandg() =(r) β,H;(r),j=0where the function112d(c) =exp(V2元is the N(0, 1) density function, and [H;(r)) is the sequence of Hermite polynomials,defined asdi(-1)Φ() =-Hj-1(r)() for j> 0,drjwhere Φ() is the N(0, 1) CDF. For example,Ho(r) = 1,H(a) = r,H2(z) = (r2-1)H3(r) =r(r-3)H4() = r4 - 6r2+ 3.See, for example, Magnus, Oberhettinger and Soni (1966, Section 5.6) and Abramowitzand Stegun (1972, Ch.22)Here,the Fourier coefficientg(r)H,;(r)o(r)dr.3.Again,β,→0 as j-→oogiven Ej=oBg<oThe N(O, 1) PDF (r) is the leading term to approximate the unknown density g(r),and the Hermite polynomial series will capture departures from normality (e.g., skewnessand heavy tails)To estimate g(r), we can consider the sequence of truncated probability densitiesgp(r) =C-ld(r) β,H;(r),j=0where the constantH;(r)o(r)dam=8
Example 2 [Probability Density Function]: Suppose the PDF g(x) of Xt is a smooth function with unbounded support. We can expand g(x) = (x) X1 j=0 jHj (x); where the function (x) = 1 p 2 exp( 1 2 x 2 ) is the N(0; 1) density function, and fHj (x)g is the sequence of Hermite polynomials, deÖned as (1)j d j dxj (x) = Hj1(x)(x) for j > 0; where () is the N(0; 1) CDF. For example, H0(x) = 1; H1(x) = x; H2(x) = (x 2 1) H3(x) = x(x 2 3); H4(x) = x 4 6x 2 + 3: See, for example, Magnus, Oberhettinger and Soni (1966, Section 5.6) and Abramowitz and Stegun (1972, Ch.22). Here, the Fourier coe¢ cient j = Z 1 1 g(x)Hj (x)(x)dx: Again, j ! 0 as j ! 1 given P1 j=0 2 j < 1: The N(0; 1) PDF (x) is the leading term to approximate the unknown density g(x), and the Hermite polynomial series will capture departures from normality (e.g., skewness and heavy tails). To estimate g(x); we can consider the sequence of truncated probability densities gp(x) = C 1 p (x) X p j=0 jHj (x); where the constant Cp = X p j=0 j Z Hj (x)(x)dx 8

is a normalization factor to ensure that gp(r) is a PDF for each p. The unknown pa-rameters [βj] can be estimated from the sample [Xt]t-1 via the maximum likelihoodestimation (MLE) method. For example, suppose {Xt) is an IID sample. ThenTβ= argmaxlngp(Xi)t=1To ensure thatgp(r) = Cpld(a) j-oB,H;(c)is asymptotically unbiased, we must let p = p(T) -→ oo as T → oo. However, p mustgrow more slowly than the sample size T grows to infinity so that the sampling variationofβwill not betoo large.For the use of Hermite Polynomial series expansions, see (e.g.) Gallant and Tauchen(1996, Econometric Theory), Ait-Sahalia (2002, Econometrica), and Cui, Hong and Li(2020).Question: What are the advantages of nonparametric smoothing methods?They require few assumptions or restrictions on the data generating process. Inparticular, they do not assume a specific functional form for the function of interest(of course certain smoothness condition such as differentiability is required). They candeliver a consistent estimator for the unknown function, no matter whether it is linear ornonlinear.Thus,nonparametricmethodscan effectivelyreduce potentialsystematicbiases due to model misspecification, which is more likely to be encountered for parametricmodeling.Question: What are the disadvantages of nonparametric methods?. Nonparametric methods require a large data set for reasonable estimation. Fur-thermore, there exists a notorious problem of “curse of dimensionality," when thefunction of interest contains multiple explanatory variables. This will be explainedbelow.·There exists another notorious “boundary effect"problem for nonparametric esti-mation near the boundary regions of the support. This occurs due to asymmetriccoverage of data in the boundary regions.9
is a normalization factor to ensure that gp(x) is a PDF for each p: The unknown parameters fjg can be estimated from the sample fXtg T t=1 via the maximum likelihood estimation (MLE) method. For example, suppose fXtg is an IID sample. Then ^ = arg max X T t=1 ln ^gp(Xt) To ensure that g^p(x) = C^1 p (x) Xp j=0^ jHj (x) is asymptotically unbiased, we must let p = p(T) ! 1 as T ! 1: However, p must grow more slowly than the sample size T grows to inÖnity so that the sampling variation of ^ will not be too large. For the use of Hermite Polynomial series expansions, see (e.g.) Gallant and Tauchen (1996, Econometric Theory), AÔt-Sahalia (2002, Econometrica), and Cui, Hong and Li (2020). Question: What are the advantages of nonparametric smoothing methods? They require few assumptions or restrictions on the data generating process. In particular, they do not assume a speciÖc functional form for the function of interest (of course certain smoothness condition such as di§erentiability is required). They can deliver a consistent estimator for the unknown function, no matter whether it is linear or nonlinear. Thus, nonparametric methods can e§ectively reduce potential systematic biases due to model misspeciÖcation, which is more likely to be encountered for parametric modeling. Question: What are the disadvantages of nonparametric methods? Nonparametric methods require a large data set for reasonable estimation. Furthermore, there exists a notorious problem of ìcurse of dimensionality,îwhen the function of interest contains multiple explanatory variables. This will be explained below. There exists another notorious ìboundary e§ectîproblem for nonparametric estimation near the boundary regions of the support. This occurs due to asymmetric coverage of data in the boundary regions. 9

Coefficients are usually difficult to interpret from an economic point of view. There exists a danger of potential overfitting, in the sense that nonparametricmethod, due to its fexibility, tends to capture non-essential features in a datawhich will not appear in out-of-sample scenarios.The above two motivating examples are the so-called orthogonal series expansionmethods.There are other nonparametric methods,such as splines smoothing,kernelsmoothing, k-near neighbor, and local polynomial smoothing. As mentioned earlier,series expansion methods are examples of so-called global smoothing, because thecoefficientsareestimatedusingallobservations.and theyarethenusedtoevaluatethevalues of the underlying function over all points in the support of Xt. A nonparametricseries model is an increasing sequence of parametric models, as the sample size T grows.In this sense, it is also called a sieve estimator. In contrast, kernel and local polynomialmethods are examples of the so-called local smoothing methods, because estimationonly requires the observations in a neighborhood of the point of interest. Below we willmainly focus onkernel and local polynomial smoothing methods, due to their simplicityand intuitivenature.2 Kernel Density Method2.1Univariate Density EstimationSuppose [Xt] is a strictly stationary time series process with unknown marginal PDFg(r).Question:How to estimate the marginal PDF g(r)of thetime series process[X,]?We first consider a parametric approach. Assume that g(r) is an N(μ,o2) PDFwith unknown μ and 2. Then we know the functional form of g() up to two unknownparameters=(μ,o2):g(r,0) =-8<<8.V2r0 xp [-202(a -10
Coe¢ cients are usually di¢ cult to interpret from an economic point of view. There exists a danger of potential overÖtting, in the sense that nonparametric method, due to its áexibility, tends to capture non-essential features in a data which will not appear in out-of-sample scenarios. The above two motivating examples are the so-called orthogonal series expansion methods. There are other nonparametric methods, such as splines smoothing, kernel smoothing, k-near neighbor, and local polynomial smoothing. As mentioned earlier, series expansion methods are examples of so-called global smoothing, because the coe¢ cients are estimated using all observations, and they are then used to evaluate the values of the underlying function over all points in the support of Xt . A nonparametric series model is an increasing sequence of parametric models, as the sample size T grows. In this sense, it is also called a sieve estimator. In contrast, kernel and local polynomial methods are examples of the so-called local smoothing methods, because estimation only requires the observations in a neighborhood of the point of interest. Below we will mainly focus on kernel and local polynomial smoothing methods, due to their simplicity and intuitive nature. 2 Kernel Density Method 2.1 Univariate Density Estimation Suppose fXtg is a strictly stationary time series process with unknown marginal PDF g(x): Question: How to estimate the marginal PDF g(x) of the time series process fXtg? We Örst consider a parametric approach. Assume that g(x) is an N(; 2 ) PDF with unknown and 2 : Then we know the functional form of g(x) up to two unknown parameters = (; 2 ) 0 : g(x; ) = 1 p 22 exp 1 2 2 (x ) 2 ; 1 < x < 1: 10