In many situations in demography and the social sciences. however, we have a dependent variable, Y, that is dichotomous Lecture 10 rather than continuous, e.g., whether or not a woman has had a second birth whether nd birth is a male or a female baby, whether or not a woman uses any contraceptive method, whether or not a person Logistic regression has migrated in the last 5 years, whether or the People's University staff uses public transportation coming to work, whether or not an under-graduate student completed study last year in the Demography Department in People's University was awarded BA degree Why use logistic regression? In all these situations the outcome ofy In linear regression alue 1 represents yes, or a"success, and the value o no or a failure the +bX1+b2x2+…+bxn+e mean of this dichotomous(also referred to binary) dependent variable, designated the dependent variable, Y, is conti p, is the proportion of times that it takes and unbounded and we want to value 1 set of explanatory(independent, or X) variables that will assist us in predicting its mean value while explaining its observe variability
1 1 Lecture 10 Logistic Regression 2 Why use logistic regression? In linear regression: Y =b0 + b1X1 + b2X2 + .... + bnXn + e the dependent variable, Y, is continuous and unbounded, and we want to identify a set of explanatory (independent, or X) variables that will assist us in predicting its mean value while explaining its observed variability 2 3 In many situations in demography and the social sciences, however, we have a dependent variable, Y, that is dichotomous, rather than continuous, e.g., whether or not a woman has had a second birth, whether the second birth is a male or a female baby, whether or not a woman uses any contraceptive method, whether or not a person has migrated in the last 5 years, whether or not the People’s University staff uses public transportation coming to work, whether or not an under-graduate student completed study last year in the Demography Department in People’s University was awarded BA degree, etc. 4 In all these situations, the outcome of Y only assumes two forms; usually, the value 1 represents yes, or a “success,” and the value 0, no, or a “failure.” The mean of this dichotomous (also referred to as binary) dependent variable, designated p, is the proportion of times that it takes the value 1
EXample: Obtaining Abortion To make a statistical model of this relationship, we could feasibly fit a linear The data in the following tables were gression line to the cases with derived from the 1997 survey, which pregnancy number as the explanatory contains information on abortion use and ariable and a dichotomous dependent associated information the tables and the variable(0=not having abortion, 1=havit chart show the incidence of abortion abortion). There are two main problems according to the number of pregnancies. It with this approach can be seen that the proportion of women obtaining abortion increases rapidly from a very low proportion at the first pregnancy to a big proportion among women having 5 or more pregnancies. The first problem is that it is possible, and indeed happens in this case, that the fitted regression line will cross below zero andor above one right in the range where we do not want that to occur the fitted regression line can be shown to have the form p=-0.01314+0.13798*PREG where p is the proportion having abortion and PREG is numbers of pregnancy
3 5 Example: Obtaining Abortion • The data in the following tables were derived from the 1997 survey, which contains information on abortion use and associated information. The tables and the chart show the incidence of abortion according to the number of pregnancies. It can be seen that the proportion of women obtaining abortion increases rapidly from a very low proportion at the first pregnancy to a big proportion among women having 5 or more pregnancies. 6 PREG5 * whether abortion Crosstabulation Count 880 9 889 950 418 1368 499 461 960 230 271 501 102 193 295 2661 1352 4013 1 2 3 4 5 PREG5 Total no yes whether abortion Total PREG5 * whether abortion Crosstabulation % within PREG5 99.0% 1.0% 100.0% 69.4% 30.6% 100.0% 52.0% 48.0% 100.0% 45.9% 54.1% 100.0% 34.6% 65.4% 100.0% 66.3% 33.7% 100.0% 1.00 2.00 3.00 4.00 5.00 PREG5 Total no yes whether abortion Total PREG5 1 2 3 4 5 Mean whether abortion .7 .6 .5 .4 .3 .2 .1 0.0 4 7 • To make a statistical model of this relationship, we could feasibly fit a linear regression line to the cases with pregnancy number as the explanatory variable and a dichotomous dependent variable (0=not having abortion, 1=having abortion). There are two main problems with this approach. 8 • The first problem is that it is possible, and indeed happens in this case, that the fitted regression line will cross below zero and/or above one right in the range where we do not want that to occur. The fitted regression line can be shown to have the form p = -0.01314+ 0.13798*PREG where p is the proportion having abortion and PREG is numbers of pregnancy
The estimated probability can be Residual plot greater than 1 or less than 0 This line is above 1 up to 7 pregnancies, meaning that more than 100 per cent of all pregnancies at pregnancy 7 or over are aborted. And there are many other cases where the predicted proportions are negative. Apart from the fact that such results are impossible, we might nevertheless be inclined to accept them in the limited range where they are valid The linearity assumption is Recall that this scatter diagram should seriously violated show no pattern at all, as if a handful of stones were dropped at the centre of the This would be very dangerous to do, diagram On the contrary, in this case it because of the second problem, which is could not show a pattern more clearly! This that the assumptions of linear regression are violated badly in this case. This can be pattern of two lines across the diagram is seen clearly in the plots obtained with the caused by the fact that the dependent REGRESSION sub-commands variable can only take two values(1 for particularly the final scatterplot of the having abortion and 0 otherwise), and the distribution of the residuals consequently standardized residuals against the predicted values has a binomial distribution not a normal distribution
5 9 The estimated probability can be greater than 1 or less than 0 • This line is above 1 up to 7 pregnancies, meaning that more than 100 per cent of all pregnancies at pregnancy 7 or over are aborted. And there are many other cases where the predicted proportions are negative. Apart from the fact that such results are impossible, we might nevertheless be inclined to accept them in the limited range where they are valid. 10 The linearity assumption is seriously violated • This would be very dangerous to do, because of the second problem, which is that the assumptions of linear regression are violated badly in this case. This can be seen clearly in the plots obtained with the REGRESSION sub-commands, particularly the final scatterplot of the standardized residuals against the predicted values: 6 11 Residual plot Standardized Residual -3 -2 -1 0 1 2 3 Standardized Predicted Value 6 5 4 3 2 1 0 -1 -2 12 • Recall that this scatter diagram should show no pattern at all, as if a handful of stones were dropped at the centre of the diagram. On the contrary, in this case it could not show a pattern more clearly! This pattern of two lines across the diagram is caused by the fact that the dependent variable can only take two values (1 for having abortion and 0 otherwise), and the distribution of the residuals consequently has a binomial distribution, not a normal distribution
Because we break the linearity This curve has the following form when the assumption the usual hypothesis testing parameter b, is positive, or its mirror image when the parameter is negative procedures are invali R square tends to be very low. The fit of the line is poor because the response can only be 0 or 1 so the values do not cluster around the line 98=653210 Logistic Regression Logistic Function To get around both problems, we will The logistic curve has the property that it instead fit a curve of a particular form to never takes values less than zero or greater the data. This type of curve, known as a than one. The way to fit it is to transform the logistic curve, has the following general definition of the logistic curve given above into a linear form P=exp(bo+b, X)(1+exp(bo+b, X) loge(p/(1-p)=bo+b,'X The function on the left-hand side of this where p is the proportion at each value of equation has various names, of which the the explanatory variable X, bo and b,are most common are the logistic function ' and numerical constants to be estimated and the log-odds function. The log equatior exp is the exponential function has the general form of a linear model
7 13 • Because we break the linearity assumption the usual hypothesis testing procedures are invalid. • R square tends to be very low. The fit of the line is poor because the response can only be 0 or 1 so the values do not cluster around the line. 14 Logistic Regression • To get around both problems, we will instead fit a curve of a particular form to the data. This type of curve, known as a logistic curve, has the following general form: P = exp(b0+b1*X)/(1+exp(b0+b1*X)) where p is the proportion at each value of the explanatory variable X, b0 and b1 are numerical constants to be estimated, and exp is the exponential function. 8 15 • This curve has the following form when the parameter b1 is positive, or its mirror image when the parameter is negative. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 16 Logistic Function • The logistic curve has the property that it never takes values less than zero or greater than one. The way to fit it is to transform the definition of the logistic curve given above into a linear form: loge(p/(1-p)) = b0+b1*X • The function on the left-hand side of this equation has various names, of which the most common are the 'logistic function' and the 'log-odds function'. The log equation has the general form of a linear model
Probability and Odds Taking the natural logarithm of each side of the odds equation yields the following A probability is the likelihood that a given event will occur. It is the frequency of a given outcome divided by the total number of all possible outcomes +bx a definition of "odds"is the likelihood of a en event occurring, compared to the likelihood of the same event not occurring The above equation has the logit o on the left-side bo+b The logit is a linear Probabi bility p=1 bo+bx hn bo+br function of the X bother The probability is Odds a non -inear 1+eo+x function of the X ariables
9 17 Probability and Odds • A probability is the likelihood that a given event will occur. It is the frequency of a given outcome divided by the total number of all possible outcomes. • A definition of “odds” is the likelihood of a given event occurring, compared to the likelihood of the same event not occurring. 18 Probability Odds 10 19 Taking the natural logarithm of each side of the odds equation yields the following: The above equation has the logit on the left-side 20 The logit is a linear function of the X variables The probability is a non-linear function of the X variables
The Logistic Regression Model SPSS Results InP=b+bx+e The most fundamental part is the estimation equation derived from the coefficients in the last table of the display In is the natural logarithm, loge, where e=2.71828 p is the probability that the event Y occurs, p(Y=1) p/(1-p)is the " odds In(p/(1-p)is the log odds, or logit" all other components are the same as before Using Logistic Regression n(p(1-p)=2506+0691PREG We now proceed to a logistic regression with abortion as the dependent variable For any number of pregnancy(PREG)we and pregnancy number as the can calculate the log odds' directly from explanatory variable pregnancy2)-2506+06912=-11229 LOGISTIC REGRESSION VAR=abortion /METHOD=ENTER preg pregnancy4)-2506+06914=02597 /PRINT=ITER(1) pregnancy6)-2506+0.69176=16424 ICRITERIA PIN(.05)POUT(10) ITERATE(20)CUT(.5)
11 21 The Logistic Regression Model where: • ln is the natural logarithm, loge, where e=2.71828… • p is the probability that the event Y occurs, p(Y=1) • p/(1-p) is the "odds" • ln[p/(1-p)] is the log odds, or "logit" • all other components are the same as before 22 Using Logistic Regression • We now proceed to a logistic regression with abortion as the dependent variable and pregnancy number as the explanatory variable. 12 23 SPSS Results • The most fundamental part is the estimation equation derived from the coefficients in the last table of the display: Variables in the Equation .691 .031 484.752 1 .000 1.996 -2.506 .092 741.147 1 .000 .082 PREG Constant Step 1 a B S.E. Wald df Sig. Exp(B) a. Variable(s) entered on step 1: PREG. 24 ln(p/(1-p)) = -2.506+ 0.691*PREG • For any number of pregnancy (PREG) we can calculate the 'log odds' directly from this equation, for example (pregnancy 2) -2.506+ 0.691*2 = -1.1229 (pregnancy 4) -2.506+ 0.691*4 = 0.2597 (pregnancy 6) -2.506+ 0.691*6 = 1.6424
· Notice that odds and Because these log odds values have no immediate meaning to most people, it is identical to two decimal places for small values(although to more decimal places sometimes helpful to remember that they the odds are always slightly higher) correspond directly and uniquely with proportions. As a rough guide the In the case of our model we can estimate following table shows correspondences the corresponding proportions as follows between log odds, odds(the exponential from the log odds we have alread of log odds)and proportions, for log odds calculated in the range-3 to +3 pregnancy2)exp(-1.1229(1+exp-11229)=025 pregnancy4)exp(0.2597)(1+exp(0.2597)=056 6)exp(16424(1texp(16424)=0.84 ddsOddsProportic Goodness of fit 005 The output gives us quite a lot of other information, of which the most important is the information about the likelihood ratio x 2 (called in the output -2 log likelihood for 060 ome reason). We will call this parameter LR, 2, noting that it is exactly the same 272 ter given by the statistic Chi-sq in the CROSSTABS procedure. It provides portant information about the goodness of fit of the logistic regression model. We find in two places in the output
13 25 • Because these log odds values have no immediate meaning to most people, it is sometimes helpful to remember that they correspond directly and uniquely with proportions. As a rough guide the following table shows correspondences between log odds, odds (the exponential of log odds) and proportions, for log odds in the range -3 to +3. 26 14 27 • Notice that odds and proportions are identical to two decimal places for small values (although to more decimal places the odds are always slightly higher). • In the case of our model, we can estimate the corresponding proportions as follows from the log odds we have already calculated: (pregnancy 2) exp(-1.1229)/(1+exp(-1.1229)) = 0.25 (pregnancy 4) exp(0.2597)/(1+exp(0.2597)) = 0.56 (pregnancy 6) exp(1.6424)/(1+exp(1.6424)) = 0.84 28 Goodness of fit • The output gives us quite a lot of other information, of which the most important is the information about the likelihood ratio χ2 (called in the output '-2 log likelihood' for some reason). We will call this parameter LR χ2, noting that it is exactly the same parameter given by the statistic Chi-square in the CROSSTABS procedure. It provides important information about the goodness of fit of the logistic regression model. We find it in two places in the output:
Before the variable preg is entered into the model Omnibus Tests of Model Coefficients Block 0: Beginning Bl Step 1 Step615480 Bck615480 2 og COefficie吨ts od Constant 5183-67 2 Log Cox& Snell Nagelkerke R b. Initial-2 Log Likelihood: 5128.303 4512823 Est mation terminated at iteration number 3 because parameter estimates changed by less than,OD After Preg is entered into the model The initial value of LR x2(5128.303)is the Block 1: Method Enter value when only a constant term is in the model, that is when b, is equal to zer After the variable preg is included in the model. it reduces to 4512 823, a decrease of 615.840 on one degree of freedom. This 51297 decrease is interpreted as a x 2 statistic 451282 with one degree of freedom, and it is a highly significant value b. Constant is induded in the model d. Estimation teminated at iteration number 3 because lon-like hood decreased by less than. 010 percent
15 29 • Before the variable PREG is entered into the model 30 • After PREG is entered into the model 16 31 Omnibus Tests of Model Coefficients 615.480 1 .000 615.480 1 .000 615.480 1 .000 Step Block Model Step 1 Chi-square df Sig. Model Summary 4512.823 .142 .197 Step 1 -2 Log likelihood Cox & Snell R Square Nagelkerke R Square 32 • The initial value of LR χ2 (5128.303) is the value when only a constant term is in the model, that is when b1 is equal to zero. After the variable PREG is included in the model, it reduces to 4512.823, a decrease of 615.840 on one degree of freedom. This decrease is interpreted as a χ2 statistic with one degree of freedom, and it is a highly significant value
Notice that the three rows of the omnibus Multivariate Logistic Regression tests table headed Model. 'block and Step all have the same content in this example, since the single explanatory Suppose that we want to find out the relationship between educational variable was entered in a single step attainment and the likelihood that a Generally, it is the'step' information that is pregnancy is being aborted, to test the used to examine whether a change to the hypothesis that better-educated women model at the previous step is worthwhile. It are more likely to abort their pregnancies is in this case than uneducated women. (This is one of R-square, similar to that in linear he effects of modernization in many regression, is reported ulations The output also contains a classification table showing which cases are classified correctly and In logistic regression no particular distinction is incorrectly by their predicted values based on made between covariates and other gnancy number. Note that the model does not do explanatory variables. For SPSS, which 70 innot ordinarily distinguish between true including slightly over one-third of those who actually interval variables and categorical or nominal had abortion. This is to be expected, if we try to ones, categorical variables must be specifically predict abortion on the basis of pregnancy number identified in the loGISTIC rEGRESSIon and nothing else procedure. For the abortion model which we wish to analyse, the dependent variable ABORTION dichotomous the variables age and preg are interval variables and educat is ordinal Ordinal variables should be treated as if they a Theart waue i 00 are categorical in a logistic regression. x
17 33 • Notice that the three rows of the omnibus tests table, headed 'Model', 'Block' and 'Step' all have the same content in this example, since the single explanatory variable was entered in a single step. Generally, it is the 'step' information that is used to examine whether a change to the model at the previous step is worthwhile. It is in this case. • R-square, similar to that in linear regression, is reported. 34 • The output also contains a classification table, showing which cases are classified correctly and incorrectly by their predicted values based on pregnancy number. Note that the model does not do extremely well, getting only 70 per cent correct, including slightly over one-third of those who actually had abortion. This is to be expected, if we try to predict abortion on the basis of pregnancy number and nothing else. Classification Tablea 2329 332 87.5 888 464 34.3 69.6 Observed no yes whether abortion Overall Percentage Step 1 no yes whether abortion Percentage Correct Predicted a. The cut value is .500 18 35 Multivariate Logistic Regression • Suppose that we want to find out the relationship between educational attainment and the likelihood that a pregnancy is being aborted, to test the hypothesis that better-educated women are more likely to abort their pregnancies than uneducated women. (This is one of the effects of 'modernization' in many populations.) 36 • In logistic regression no particular distinction is made between covariates and other explanatory variables. For SPSS, which cannot ordinarily distinguish between true interval variables and categorical or nominal ones, categorical variables must be specifically identified in the LOGISTIC REGRESSION procedure. • For the abortion model which we wish to analyse, the dependent variable ABORTION is dichotomous, the variables AGE and PREG are interval variables, and EDUCAT is ordinal. Ordinal variables should be treated as if they are categorical in a logistic regression
This is quite a Carry out a logistic regression to test the satisfactory result, hypothesis that abortion prevalence is because there is a higher among more educated women. large reduction in controlling for age of woman and number LR, 2, from b. ntial of pregnancies 5128303 5 3983040,a C Estimaion trmnatd at taton number 3 be parame estimates aped by ess tan duction of LOGISTIC REGRESSION VAR=abortion 11452630n6 Onmibus Tests of Mode Coefficients /METHOD=ENTER age preg educat /CONTRAST(educat =Indicator(1) degrees of RINTEITER(1) freedom This is a /CRITERIA PIN(.05)POUT(10)ITERATE(20)CUT(.5) ighly significant result with When there is a categorical independent Null model(logodds(ABORTION) =constant) variable. a CoNtRast statement must be LRx2=5128303 specified. SPSS by default regards the last category of the categorical variable as Model including EDUCAT, PREG and AGE the base category. In the version given LRx2=3983040 here, we use the first category. If there are · Reduction two or more categorical variables, a LRr2=1145263 separate CoNTRAST statement i required for each categorical variable to (6 degrees of freedom, p<0.00-01) achieve the desired result
19 37 • Carry out a logistic regression to test the hypothesis that abortion prevalence is higher among more educated women, controlling for age of woman and number of pregnancies. 38 • When there is a categorical independent variable, a CONTRAST statement must be specified. SPSS by default regards the last category of the categorical variable as the base category. In the version given here, we use the first category. If there are two or more categorical variables, a separate CONTRAST statement is required for each categorical variable to achieve the desired result. 20 39 • This is quite a satisfactory result, because there is a large reduction in LRχ2, from 5128.303 to 3983.040, a reduction of 1145.263 on 6 degrees of freedom. This is a highly significant result, with p<0.00···01. 40 • Null model (logodds(ABORTION)=constant): LR χ2 = 5128.303 • Model including EDUCAT, PREG and AGE: LR χ2 = 3983.040 • Reduction LR χ2 = 1145.263 (6 degrees of freedom, p<0.00···01)