Simple Linear Regression Lecture 8 Like correlation, there are two major Simple Linear assumptions Regression The relationship should be linear, and The level of data must be continuou Simple Linear Regression The regression equation (Bivariate Regression) The purpose of simple linear looked at measuring relationships to fit a line to the two variables this line is between two interval variables using correlation called the line of best fit, or the Now we continue to look at the bivariate analysis of the two variables using regression analysis regression line. When we do a scatterplot However, the purpose of doing regression rather of two variables, it is possible to fit a line than correlation is that we can predict results in which best represents the data one variable based on another variable. so rather than simply see if the variables are related, we can interpret their effect
1 1 Lecture 8 Simple Linear Regression 2 Simple Linear Regression (Bivariate Regression) We already looked at measuring relationships between two interval variables using correlation. Now we continue to look at the bivariate analysis of the two variables using regression analysis. However, the purpose of doing regression rather than correlation is that we can predict results in one variable, based on another variable. So, rather than simply see if the variables are related, we can interpret their effect. 2 3 Simple Linear Regression Like correlation, there are two major assumptions: • The relationship should be linear; and • The level of data must be continuous 4 The regression equation The purpose of simple linear regression is to fit a line to the two variables. This line is called the line of best fit, or the regression line. When we do a scatterplot of two variables, it is possible to fit a line which best represents the data
The regression equation The regression equation A regression equation is used to define the relationship between two variables. It These represent the following: takes the form a or A=Constant value. It is the value at which the line intersects the Y=a+bX bor A Contant vale It is the slope(or gradient)or the ne. It represents the change in Y for each inerease or decrease nX. Y=B0+B1X1+8 I= The value of the x variable for each case The regression equation Scatterplot and regression ine They are essentially the same, except that the second includes an error term at the end. This error term indicates that what we Change in Y s 1o have is in fact a model and hence won t fit the data perfectly
3 5 The regression equation A regression equation is used to define the relationship between two variables. It takes the form: or 6 The regression equation They are essentially the same, except that the second includes an error term at the end. This error term indicates that what we have is in fact a model, and hence won't fit the data perfectly. 4 7 The regression equation 8 Scatterplot and regression line 0 10 20 30 40 50 60 70 012345 X YChange in X is 1 Change in Y is 10 Intercept is 20
How do we fit a line to data? Now, we do not have to test every possible line to see which fits the data best. The method of least squares In order to fit a line of best fit we use a provides the optimal values of a(or岛)andb(or月 method called the method of least Squares. This method allows us to Once we have established them, we can use them in determine which line, out of all the lines the regression equation. that could be drawn, best represents the The formulas for calculating a and b are least amount of difference between the actual values(the data points )and the n(EXr)-CxcEr predicted line In the Figure above, three data points fall on the Example 1 line, while the remaining 6 are slightly above or below the line. The difference between these Agen Children ints and the line are called residuals. some of thile others will be negative(below the line) we add up all these differences, some of the positive and negative values will cancel each other out. which will have the effect of verestimating how well the line represents the data. Instead, if we square the differences and Total 149 92994903 then add them up then we can ut which line has the smallest sum of squares(that is, the one with the least error) so,n=5,2X=149,ZY=9,2XY=29922=403 22=19,(2X2=149149=2221
5 9 How do we fit a line to data? In order to fit a line of best fit we use a method called the Method of Least Squares. This method allows us to determine which line, out of all the lines that could be drawn, best represents the least amount of difference between the actual values (the data points) and the predicted line. 10 In the Figure above, three data points fall on the line, while the remaining 6 are slightly above or below the line. The difference between these points and the line are called residuals. Some of these differences will be positive (above the line), while others will be negative (below the line). If we add up all these differences, some of the positive and negative values will cancel each other out, which will have the effect of overestimating how well the line represents the data. Instead, if we square the differences and then add them up then we can work out which line has the smallest sum of squares (that is, the one with the least error). 6 11 12 Example 1
CI)-()C∑1 We could now draw a line of best fit x2)-②m through the observed data points 5*(299)-(149)*(9)154 5*(4803)-(202181085 9-0085*(149) 5 1520253035404550 =-0.73+0085X Predictio Regression Variables Entered/ Remove based on their age. So, if some woman in the community re aged 27 we could predict that their CEB number was: Model Summary F=-0.73+0085*27=1.56 a Predictors: (Constant), AGE 8
7 13 14 Prediction 8 15 We could now draw a line of best fit through the observed data points 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 15 20 25 30 35 40 45 50 Age Number of children 16
Example 2: Crying and IQ ANova Infants who cry easily may be a sign of higher IQ. Crying intensity and IQ data on Crying Io Crying IQ Crying IQ crying b, Dependent vaiable: EB 181091511 1112111416118 Dependant variable CEB Inference for regression When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, we can use the least squares line fitted to the data to predict y for a given value of x. Now we want to do tests and confidence intervals in this
9 17 18 Inference for Regression • When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, we can use the leastsquares line fitted to the data to predict y for a given value of x. Now we want to do tests and confidence intervals in this setting. 10 19 Example 2: Crying and IQ • Infants who cry easily may be a sign of higher IQ. Crying intensity and IQ data on 38 infants: IQ=intelligence quotient 20
Plot and interpret. As always, we first This line lies as close as possible to the examine the data. Figure 3 is a scatterplot points(in the sense of least squares)in of the crying data. Plot the explanatory the vertical (y) direction. The equation of variable(crying intensity at birth) the least-squares regression line is horizontally and the response variable(IQ at age 3)vertically. Look for the form, direction, and strength of the relationship y=a+bx=91.27+1493x as well as for outliers or other deviations There is a moderate positive linear Becauser=0.207, about 21% of the relationship, with no extreme outliers or n in IQ scores is explained by potentially influential observations intensity. See SPSS output Numerical summary. Because the scatterplot shows a roughly linear (Cormant GVING (straight-line)patte, the correlation describes the direction and strength of the relationship. The correlation between crying and IQ is r=0.455 Mathematical model. We are interested b Dopamin wabR in predicting the response from information about the explanatory variable. So we find the least-squares n line for predicting IQ from crying
11 21 • Plot and interpret. As always, we first examine the data. Figure 3 is a scatterplot of the crying data. Plot the explanatory variable (crying intensity at birth) horizontally and the response variable (IQ at age 3) vertically. Look for the form, direction, and strength of the relationship as well as for outliers or other deviations. There is a moderate positive linear relationship, with no extreme outliers or potentially influential observations. 22 • Numerical summary. Because the scatterplot shows a roughly linear (straight-line) pattern, the correlation describes the direction and strength of the relationship. The correlation between crying and IQ is r = 0.455. • Mathematical model. We are interested in predicting the response from information about the explanatory variable. So we find the least-squares regression line for predicting IQ from crying. 12 23 This line lies as close as possible to the points (in the sense of least squares) in the vertical (y) direction. The equation of the least-squares regression line is Because = 0.207, about 21% of the variation in IQ scores is explained by crying intensity. See SPSS output: y ˆ =+ = + a bx x 91.27 1.493 2 r 24
The regression model The mean response !, has a straight-line relationship with: The slope b and intercept a of the least- quare line are statistics. That is, we =a+Bx calculated them from the sample data These statistics would take somewhat different values if we repeated the study The slope B and intercept a are unknown with different infants To do formal inference, we think of a and b as estimates The standard deviation of y(call it o )is of unknown parameters the same for all values of x the value of o unknown Assumptions for regression The heart of this model is that there is an inference on the average" straight-line relationship between y and X. The true regression line We have n observations on an explanatory u,=a+Bx says that the mean variable x and a response variable yOur response u, moves along a straight line goal is to study or predict the behavior of y the explanatory variable x changes. We for given values of x cant observe the true regression line. The For any fixed value of x, the response y values of y that we do observe vary about varies according to a normal distribution their means according to a normal Repeated responses y are independent of distribution If we hold x fixed and take each other many observations on y, the normal pattern will eventually appear in a histogram
13 25 The regression model • The slope b and intercept a of the leastsquares line are statistics. That is, we calculated them from the sample data. These statistics would take somewhat different values if we repeated the study with different infants. To do formal inference, we think of a and b as estimates of unknown parameters. 26 Assumptions for regression inference We have n observations on an explanatory variable x and a response variable y. Our goal is to study or predict the behavior of y for given values of x. • For any fixed value of x, the response y varies according to a normal distribution. Repeated responses y are independent of each other. 14 27 • The mean response has a straight-line relationship with x: The slope and intercept are unknown parameters. • The standard deviation of y (call it ) is the same for all values of x. The value of is unknown. y μ = + α β x μ y σ σ β α 28 The heart of this model is that there is an "on the average" straight-line relationship between y and x. The true regression line y μ = + α β x response moves along a straight line as the explanatory variable x changes. We can't observe the true regression line. The values of y that we do observe vary about their means according to a normal distribution. If we hold x fixed and take many observations on y, the normal pattern will eventually appear in a histogram. says that the mean μ y
igure 4 The regression model. The line is the true regression line, which shows how the mean In practice, we observe y for many different response A, changes as the explanatory variable x values of x, so that we see an overall linear changes. For any fixed value of x, the observed pattern formed by points scattered about the response y varies according to a normal distribution true line. The standard deviation g aving mean p determines whether the points fall close to the true regression line(small a )or are widely scattered(large o Figure 4 shows the regression model in Inference about the model picture form. The line in the figure is the true regression line. The mean of the ponse y moves along this line as the The first step in inference is to estimate explanatory variable x takes different the unknown parameters a, B, and o values. The normal curves show how y will When the regression model describes our vary when x is held fixed at different data and we calculate the least-squares values. all of the curves have the same o so the variability of y is the same for all line y=a+bx, the slope b of the least. squares line is an unbiased estimator of values of x. you should check the assumptions for inference when you do the true slope B, and the intercept a of the least-squares line is an unbiased inference about regression estimator of the true intercept a
15 29 In practice, we observe y for many different values of x, so that we see an overall linear pattern formed by points scattered about the true line. The standard deviation determines whether the points fall close to the true regression line (small ) or are widely scattered (large ). σ σ σ 30 • Figure 4 shows the regression model in picture form. The line in the figure is the true regression line. The mean of the response y moves along this line as the explanatory variable x takes different values. The normal curves show how y will vary when x is held fixed at different values. All of the curves have the same , so the variability of y is the same for all values of x. You should check the assumptions for inference when you do inference about regression. σ 16 31 Figure 4 The regression model. The line is the true regression line, which shows how the mean response changes as the explanatory variable x changes. For any fixed value of x, the observed response y varies according to a normal distribution having mean . μ y μ y 32 Inference about the Model • The first step in inference is to estimate the unknown parameters , , and . When the regression model describes our data and we calculate the least-squares line , the slope b of the leastsquares line is an unbiased estimator of the true slope , and the intercept a of the least-squares line is an unbiased estimator of the true intercept . α β σ y ˆ = + a bx β α
The data in Figure 3 fit the regression Recall that the residuals a model of scatter about an invisible true deviations of the data points from the least- regression line reasonably well. The least- squares line squares line is y=91.27+1.493x.The residual = observed y-predicted y slope is particularly important. A slope is a =y-y rate of change. The true slope B says how There are n residuals, one for each data point. much higher average IQ is for children with Because o is the standard deviation of responses one more intensity unit in their crying about the true regression line, we estimate it by a measurement. Because b=1.493 sample standard deviation of the residuals. We call estimates the unknown B, we estimate that this sample standard deviation a standard error to on the average lQ is about 1.5 points higher emphasize that it is estimated from data. The for each added crying intensity residuals from a least-squares line always have mean zero. That simplifies their standard error.as Standard error about the least- We need the intercept a=91.27 to draw the line, but it has no statistical meaning in this squares line example. No child had fewer than 9 crying intensity, so we have no data near x=0 The standard error about the line The remaining parameter of the model is the standard deviation g, which describes the s=、 residual2 variability of the response y about the true regression line. The least-squares line estimates the true regression line. So the residuals estimate how much y varies about the true line Use s to estimate the unknown o in the
17 33 • The data in Figure 3 fit the regression model of scatter about an invisible true regression line reasonably well. The leastsquares line is . The slope is particularly important. A slope is a rate of change. The true slope says how much higher average IQ is for children with one more intensity unit in their crying measurement. Because b = 1.493 estimates the unknown , we estimate that on the average IQ is about 1.5 points higher for each added crying intensity. yˆ = + 91.27 1.493 x β β 34 • We need the intercept a = 91.27 to draw the line, but it has no statistical meaning in this example. No child had fewer than 9 crying intensity, so we have no data near x = 0. • The remaining parameter of the model is the standard deviation , which describes the variability of the response y about the true regression line. The least-squares line estimates the true regression line. So the residuals estimate how much y varies about the true line. σ 18 35 • Recall that the residuals are the vertical deviations of the data points from the leastsquares line: residual observed predicted ˆ y y y y = − = − There are n residuals, one for each data point. Because is the standard deviation of responses about the true regression line, we estimate it by a sample standard deviation of the residuals. We call this sample standard deviation a standard error to emphasize that it is estimated from data. The residuals from a least-squares line always have mean zero. That simplifies their standard error. σ 36 Standard error about the leastsquares line • The standard error about the line is 2 2 1 residual 2 1 ()ˆ 2 s n y y n = − = − − ∑ ∑ Use s to estimate the unknown in the regression model. σ
Because we use the standard error about the line so often in regression inference, we just call it s EXample 2(continued) Notice that s"is an average of the squared deviations of the data points from the line, so it qualifies as a variance. We average the squared The first infant had 10 crying intensity and deviations by dividing by n-2, the number of data a later IQ of 87. The predicted IQ for X=10 points less 2. It turns out that if we know n-2 of the n residuals the other two are determined That y=9127+1493x is, n-2 is the degrees of freedom of s. We first =91.27+1493×10=1062 met the idea of degrees of freedom in the case of the ordinary sample standard deviation of n The residual for this observation is observations, which has n- 1 degrees of freedom Now we observe two variables rather than one residual=y-y=87-1062=-192 and the proper degrees of freedom is n-2 rather than n-1 That is, the observed for this infant es 19.2 points below the least- squares line. Calculating s is unpleasant. You must find Repeat this calculation 37 more times, once for each subject. The 38 residuals the predicted response for each x in our ata set. then the residuals. and then s In practice we will use SPSS that does this 19.20-31.13-22.65-15.18 arithmetic instantly. Nonetheless, here is -12.18-15.15-16.63-6.18 an example to help you understand the standard error s -9.15-23.58-9.142.80 -9.14-1.66-6.14-12.60 0.34-8.62 9.8210.820.378.85 10.8719.3410.89-2.55 20.8524.3518.9432.89
19 37 • Because we use the standard error about the line so often in regression inference, we just call it s. Notice that is an average of the squared deviations of the data points from the line, so it qualifies as a variance. We average the squared deviations by dividing by n - 2, the number of data points less 2. It turns out that if we know n - 2 of the n residuals, the other two are determined. That is, n - 2 is the degrees of freedom of s. We first met the idea of degrees of freedom in the case of the ordinary sample standard deviation of n observations, which has n - 1 degrees of freedom. Now we observe two variables rather than one, and the proper degrees of freedom is n - 2 rather than n - 1. 2 s 38 • Calculating s is unpleasant. You must find the predicted response for each x in our data set, then the residuals, and then s. In practice we will use SPSS that does this arithmetic instantly. Nonetheless, here is an example to help you understand the standard error s. 20 39 Example 2 (continued) • The first infant had 10 crying intensity and a later IQ of 87. The predicted IQ for x=10 is: ˆ 91.27 1.493 91.27 1.493 10 106.2 y x = + = + ×= The residual for this observation is residual 87 106.2 19.2 = − = − =− y yˆ That is, the observed IQ for this infant lies 19.2 points below the leastsquares line. 40 • Repeat this calculation 37 more times, once for each subject. The 38 residuals are: -19.20 -31.13 -22.65 -15.18 -12.18 -15.15 -16.63 -6.18 -1.70 -22.60 -6.68 -6.17 -9.15 -23.58 -9.14 2.80 -9.14 -1.66 -6.14 -12.60 0.34 -8.62 2.85 14.30 9.82 10.82 0.37 8.85 10.87 19.34 10.89 -2.55 20.85 24.35 18.94 32.89 18.47 51.32