Statistical Learning Theory and Applications Lecture 3 Regression Models Instructor:Quan Wen SCSE@UESTC Fall 2021
Statistical Learning Theory and Applications Lecture 3 Regression Models Instructor: Quan Wen SCSE@UESTC Fall 2021
Outline (Level 1) ①A Case 2 Least Squares Method 3 From Linear to Nonlinear:Using Linear Model 4How Regression Got Its Name 5】 Probability Interpretation 6 Bias-Variance Dilemma 1178
Outline (Level 1) 1 A Case 2 Least Squares Method 3 From Linear to Nonlinear: Using Linear Model 4 How Regression Got Its Name 5 Probability Interpretation 6 Bias–Variance Dilemma 1 / 78
Topics: Basic theoretical concepts,properties,calculations of regression analysis Derivation and calculation of least squares o Probability interpretation of regression analysis o Regression analysis of nonlinear functions o Bias-variance dilemma of regression analysis Key points and difficulties: Key points:Derivation and calculation of least squares o Difficulties:Probability interpretation of regression analysis 2/78
Topics: Basic theoretical concepts, properties, calculations of regression analysis Derivation and calculation of least squares Probability interpretation of regression analysis Regression analysis of nonlinear functions Bias-variance dilemma of regression analysis Key points and difficulties: Key points: Derivation and calculation of least squares Difficulties: Probability interpretation of regression analysis 2 / 78
Outline (Level 1) A Case Least Squares Method From Linear to Nonlinear:Using Linear Model How Regression Got Its Name Probability Interpretation Bias-Variance Dilemma 3/78
Outline (Level 1) 1 A Case 2 Least Squares Method 3 From Linear to Nonlinear: Using Linear Model 4 How Regression Got Its Name 5 Probability Interpretation 6 Bias–Variance Dilemma 3 / 78
1.A Case o Investigate the trend of housing prices,with the following data: Year m2 Price (10k$) 1999 70 6 2000 60 6 2001 120 20 2002 125 26 o These data are usually expected to predict the future trend of house prices 4/78
1. A Case Investigate the trend of housing prices, with the following data: Year m 2 Price (10k$) 1999 70 6 2000 60 6 2001 120 20 2002 125 26 . . . . . . . . . These data are usually expected to predict the future trend of house prices 4 / 78
Let x =[x1,x2,..,xM be a regressor with each dimension represents a feature input.d corresponds to an output of x.Their dependencies can be expressed by a linear regression model as follows: M d= ∑ Wixi+E i=1 1 w1,w2,...,wM:set of fixed but unknown parameters. 2 e:expected error of the model."Fixed"means that we assume that the environment is stable,static. Written in a vector and matrix form: d=wx+E 5178
▶ Let x = [x1, x2, · · · , xM] T be a regressor with each dimension represents a feature input. d corresponds to an output of x. Their dependencies can be expressed by a linear regression model as follows: d = X M i=1 wixi + ε 1 w1,w2, · · · ,wM : set of fixed but unknown parameters. 2 ε: expected error of the model. ”Fixed” means that we assume that the environment is stable, static. ▶ Written in a vector and matrix form: d = w T x + ε 5 / 78
Outline (Level 1) A Case ②Least Squares Method From Linear to Nonlinear:Using Linear Model How Regression Got Its Name Probability Interpretation Bias-Variance Dilemma 6/78
Outline (Level 1) 1 A Case 2 Least Squares Method 3 From Linear to Nonlinear: Using Linear Model 4 How Regression Got Its Name 5 Probability Interpretation 6 Bias–Variance Dilemma 6 / 78
Outline (Level 2) ②Least Squares Method o Numeric Approach Analytic Approach 7/78
Outline (Level 2) 2 Least Squares Method Numeric Approach Analytic Approach 7 / 78
2.Least Squares Method 2.1.Numeric Approach Assuming a training set 2={x2,d),(2,f),…,(x,d'} defines the following cost function: haw)=2∑ew)-∑d-wx Through gradient descent algorithm,we can get w w+1=w,-n六wh(w) .n:step size (learning rate in machine learning) 8/78
2. Least Squares Method 2.1. Numeric Approach ▶ Assuming a training set Ω = {(x 1 , d 1 ),(x 2 , d 2 ), · · · ,(x N , d N )}, defines the following cost function: JΩ(w) = 1 2 X N i=1 ε 2 i (w) = 1 2 X N i=1 (d i − w T x i ) 2 ▶ Through gradient descent algorithm, we can get w wt+1 = wt − η ∂ ∂w JΩ(wt) • η : step size (learning rate in machine learning) 8 / 78
Gradient descent algorithm is based on the observations: If the real value function F(x)is differentiable and defined at a,then the function F(x)descends fastest along-VF(a),the opposite direction of the gradient at a If n has only one sample: 品am)=x{品wx-}x2xwx-0 =x(wx-d) =w-0x scalar using denominator layout/Hessian formulation for gradient. 9178
▶ Gradient descent algorithm is based on the observations: If the real value function F (x) is differentiable and defined at a, then the function F(x) descends fastest along − ▽ F (a), the opposite direction of the gradient at a ▶ If Ω has only one sample: ∂ ∂w JΩ(w) = 1 2 × ∂ ∂w (w T x − d) × 2 × (w T x − d) = x(w T x − d) = (w T x − d) | {z } scalar x using denominator layout/Hessian formulation for gradient. 9 / 78