Introduction Lecture 13 This lecture deals with the modeling of dependent variables that are event count count data. An event count refers to the Poisson Regression number of times an event occurs and is Using Stata he realization of a nonnegative integer- valued random variable. Variables that count the number of times that something has happened are common in the social This lecture covers Some examples of count variables are the number of times in a year that persons The Univariate Poisson Distribution visit the doctor. the number of car accidents that occur each day in a city, the The Poisson Regression Model number of love affairs that occur to a The Negative binomial distribution niversity student during the four years, Negative Binomial Regression the number of industrial injuries in the Comparing the Poisson and Negative workplace in a day, the number of Binomial Regression Models cigarettes a person smokes in a day, etc
1 1 Lecture 13 Poisson Regression Using Stata 2 This lecture covers • The Univariate Poisson Distribution • The Poisson Regression Model • The Negative Binomial Distribution • Negative Binomial Regression • Comparing the Poisson and Negative Binomial Regression Models 2 3 Introduction • This lecture deals with the modeling of dependent variables that are event count or count data. An event count refers to the number of times an event occurs, and is the realization of a nonnegative integervalued random variable. Variables that count the number of times that something has happened are common in the social sciences. 4 • Some examples of count variables are the number of times in a year that persons visit the doctor, the number of car accidents that occur each day in a city, the number of love affairs that occur to a university student during the four years, the number of industrial injuries in the workplace in a day, the number of cigarettes a person smokes in a day, etc
In demography, popular count variables Thus. models other than ols models are the number of children born to a ave been used to handle count data. this woman, the number of pregnancies a lecture will cover two: (1)the Poisson regression model, and (2) the negative intercourses a person has in a week, the binomial regression model. The software number of sexual partners in a year, the ed in this le number of abortions a woman has had in regression is also available in SPSS under her lifetime the number of residential general log-linear model) Before the migrations a person makes in a lifetime discussion of the Poisson regressions the number of jobs a migrant worker has let's first take a look at the univariate done since s/he arrived. eto Poisson distribution Frequently, count variables are treated as The Univariate poisson distribution though they are continuous and The univariate poisson distribution unbounded ols models are then used to provides the benchmark for Poisson estimate the effects of x variables on their occurrence OLS is appropriate if the regression. Let Y equal a random variable that represents the number of times that dependent variable, the count, is an event has occurred during an interval independently and identically distributed of time y will have a poisson However, the use of ols for count distribution with a parameter u greater outcomes can result in inefficient inconsistent and biased estimates if one or more of the OLS assumptions are not met Pr(r=y y=0,1,2
3 5 • In demography, popular count variables are the number of children born to a woman, the number of pregnancies a woman has, the number of sexual intercourses a person has in a week, the number of sexual partners in a year, the number of abortions a woman has had in her lifetime, the number of residential migrations a person makes in a lifetime, the number of jobs a migrant worker has done since s/he arrived, etc. 6 • Frequently, count variables are treated as though they are continuous and unbounded. OLS models are then used to estimate the effects of X variables on their occurrence. OLS is appropriate if the dependent variable, the count, is independently and identically distributed. However, the use of OLS for count outcomes can result in inefficient, inconsistent and biased estimates if one or more of the OLS assumptions are not met. 4 7 • Thus, models other than OLS models have been used to handle count data. This lecture will cover two: (1) the Poisson regression model, and (2) the negative binomial regression model. The software used in this lecture is Stata. (Poisson regression is also available in SPSS under general log-linear model) Before the discussion of the Poisson regressions, let’s first take a look at the univariate Poisson distribution. 8 The Univariate Poisson Distribution • The univariate Poisson distribution provides the benchmark for Poisson regression. Let Y equal a random variable that represents the number of times that an event has occurred during an interval of time. Y will have a Poisson distribution with a parameter μ greater than 0:
expected number of counts that have The variance of Y equals p. The equality occurred: for the distribution this will also of the mean and the variance is known as be the mean thus count variables have If y=0, then Pr(r=0)=exp(u) greater than the mean, which is called overdispersion. Sometimes, therefore I y=l, then Pr(r=1)=exp(u )u the Poisson regression model is not If y=2, then Pr(=2)=exp(-u )u/2 entirely appropriate, often leading the analyst to the negative binomial regression If y=3, then Pr(r=3)=exp(u )u/6 model( to be discussed later) Some properties of a As g increases, the probability of Os Poisson distribution decreases. In a poisson distribution for F=0.8, the probability of an 0 is 0.45; As u increases the mass of the for p=1.5, the probability of an 0 is 0.22; distribution shifts to the right; we'llsee for !=2.9, the probability of an 0 is 0.05, this below in the sample graphs of the for A=10.5, the probability of an 0 is univariate Poisson distribution As u increases the Poisson distribution approximates a normal distribution
5 9 • where the parameter μ represents the expected number of counts that have occurred; for the distribution this will also be the mean. Thus, 10 Some properties of a Poisson distribution • As μ increases, the mass of the distribution shifts to the right; we’ll see this below in the sample graphs of the univariate Poisson distribution. 6 11 • The variance of Y equals μ. The equality of the mean and the variance is known as equidispersion. Actually in practice, many count variables have a variance greater than the mean, which is called overdispersion. Sometimes, therefore, the Poisson regression model is not entirely appropriate, often leading the analyst to the negative binomial regression model (to be discussed later). 12 • As μ increases, the probability of 0’s decreases. In a Poisson distribution, for μ = 0.8, the probability of an 0 is 0.45; for μ = 1.5, the probability of an 0 is 0.22; for μ = 2.9, the probability of an 0 is 0.05; for μ = 10.5, the probability of an 0 is 0.00002. • As μ increases, the Poisson distribution approximates a normal distribution
Here are four examples of univariate Poisson distributions, varying on their values of p The first Poisson distribution has u=0.8 The second. F=1.5. The third, F=2.9 The fourth F =10.5 The Stata commands to produce the Four Univariate Poisson Distributions: 0.8, 1.5. 2.9 and 10.5 A critical assumption of the Poisson distribution is that when an event occurs. it prmoumls pya, plot max 20) does not affect the probability of the event occurring in the future. If the "count is tombs pyb, plot max(20) children born to mothers the assumption of independence implies that when a woman has a baby born to her, it does not affect the probability of another baby being born to her prmoumts pyd, plot max(20) In demography, however, future fertility is not independent from past fertility, and rticularly in China, the next birth( abortion)is heavily dependent upon the graph, pyapreq pytpreq pyrpreq pydpreq pyaval, dlm) gap(3) n,"probabality") previous ones in the context of the strict family planning policy 8
7 13 • Here are four examples of univariate Poisson distributions, varying on their values of μ. • The first Poisson distribution has μ =0.8. The second, μ =1.5. The third, μ =2.9. The fourth, μ =10.5. • The Stata commands to produce the graph are as below: 14 8 15 16 • A critical assumption of the Poisson distribution is that when an event occurs, it does not affect the probability of the event occurring in the future. If the “count” is children born to mothers, the assumption of independence implies that when a woman has a baby born to her, it does not affect the probability of another baby being born to her. In demography, however, future fertility is not independent from past fertility, and particularly in China, the next birth (or abortion) is heavily dependent upon the previous ones in the context of the strict family planning policy
The Poisson Regression model The Poisson Regression Model Predicting the Number of Children In the Poisson regression model, the Ever born to chinese women number of events( the dependent varable) is a nonnegative integer; it has a Poisson We are going to modeling the number of distribution with a conditional mean that children ever born(CEB)to Chinese depends on the characteristics(the x women from the 1997 survey. Before doing variables)of the individuals according to the Poisson regression, we would be the following structural model wondering if the count data are Poisson A,=exp(a+b,,+b2x2++b, Xn) Thus we conducted an analysis of the count dependent variable to compare the In(u, )=a+6,X1+bx2++b, Xh observed distribution of the count data with a univariate poisson distribution with the same mean as the count data The Poisson regression model is a nonlinear model, predicting for each individual the The dependent variable is a count variable, number of times, that the event has namely, the number of children ever born occurred. The x variables are related to u a woman. the variable is called"CEB Here are descriptive data on this variable nonlinearly sample women
9 17 The Poisson Regression Model • In the Poisson regression model, the number of events (the dependent variable) is a nonnegative integer; it has a Poisson distribution with a conditional mean that depends on the characteristics (the X variables) of the individuals according to the following structural model: 18 The Poisson regression model is a nonlinear model, predicting for each individual the number of times, μ, that the event has occurred. The X variables are related to μ nonlinearly. 10 19 The Poisson Regression Model Predicting the Number of Children Ever Born to Chinese Women • We are going to modeling the number of children ever born (CEB) to Chinese women from the 1997 survey. Before doing the Poisson regression, we would be wondering if the count data are Poisson distributed. 20 • Thus we conducted an analysis of the count dependent variable to compare the observed distribution of the count data with a univariate Poisson distribution with the same mean as the count data. • The dependent variable is a count variable, namely, the number of children ever born to a woman. The variable is called “CEB”. Here are descriptive data on this variable for the sample women:
sum ceb detail First, we estimate a Poisson regressic without any independent variables, so to be able to fit a univariate poisson distribution with a mean equal to that of Percentiles our count dependent variable, CEB, when the Poisson regression model has no independent variables, the estimated model is reduced to u=exp(a The "CEB"variable is a count variable ranging from 0 to 9, with a mean of 1.855, a poisson ceb, nolog standard deviation of 1.125, and a variance of 1266. the mean and the variance are not the same, as they are in a univariate Poisson distribution, but they are close. Unlike the case with many count variables, there is no Log likelihaad--4293,3231 overdispersion in the CEB" variable. We will use Stata's prcounts" command to graph the distribution of "CEB"in a graph along with a mi mir om in nm son a univariate Poisson distribution that has a mean of 1.855. We can then see how closely the data are poisson distributed
11 21 sum ceb, detail 22 • The “CEB” variable is a count variable, ranging from 0 to 9, with a mean of 1.855, a standard deviation of 1.125, and a variance of 1.266. The mean and the variance are not the same, as they are in a univariate Poisson distribution, but they are close. Unlike the case with many count variables, there is no overdispersion in the “CEB” variable. We will use Stata’s “prcounts” command to graph the distribution of “CEB” in a graph along with a univariate Poisson distribution that has a mean of 1.855. We can then see how closely the data are Poisson distributed. 12 23 • First, we estimate a Poisson regression without any independent variables, so to be able to fit a univariate Poisson distribution with a mean equal to that of our count dependent variable, CEB, namely 1.855. • when the Poisson regression model has no independent variables, the estimated model is reduced to: 24 poisson ceb, nolog
Observe that the Poisson intercept in this CEB Distribution and Poisson Distribution with mu=1.855 model, which has no independent variables, is. 6178104. We exponentiate this value that is. e 6178104= 1.854862 which is indeed the mean of the CEB Now we use the "prcounts"command to graph the observed distribution of the CEB variable with a univariate poisson distribution with a mean of 1 855 Observed CEB Distrbution -+Unvariate Poisson, mu=13 Stata command The graph shows that the children ever bon ariable is pretty much Poisson distributed The univariate poisson distribution over- prcounts cebprob, plot max(10) predicts the observed CeB distribution at the label var cebprobobeq "Observed CEB Distribution count of zero, and under-predicts counts of 1 label var cebprobpreg Univariate Poisson, mu= and 2; the remaining counts are pretty close 855 label var cebprobval " Number of Children Ever Borm The observed CEB distribution, compared to the univariate poisson distribution with a graph twoway connected cebprobobeq cebprobpreq mean, u,of 1.855, has substantially fewer cebprobval, title("Proportion or Probability) 0s, and more cases in the earlier counts label(0(1). 4) xlabel(o(1)9)title("CEB Distribution and Poisson Distribution with mu=1.855) Even though the two distributions are not perfect, we may conclude that we are correct in estimating the CEB dependent variable with a poisson model
13 25 • Observe that the Poisson intercept in this model, which has no independent variables, is .6178104 . We exponentiate this value, that is, e.6178104 = 1.854862, which is indeed the mean of the “CEB” variable. • Now we use the “prcounts” command to graph the observed distribution of the CEB variable with a univariate Poisson distribution with a mean of 1.855. 26 Stata command: prcounts cebprob, plot max(10) label var cebprobobeq "Observed CEB Distribution" label var cebprobpreq "Univariate Poisson, mu = 1.855" label var cebprobval "Number of Children Ever Born" graph twoway connected cebprobobeq cebprobpreq cebprobval, ytitle("Proportion or Probability") ylabel(0(.1).4) xlabel(0(1) 9) title("CEB Distribution and Poisson Distribution with mu = 1.855") 14 27 0 .1 .2 .3 .4 Proportion or Probability 0 1 2 3 4 5 6 7 8 9 Number of Children Ever Born Observed CEB Distribution Univariate Poisson, mu = 1.855 CEB Distribution and Poisson Distribution with mu = 1.855 28 • The graph shows that the children ever born variable is pretty much Poisson distributed. The univariate Poisson distribution overpredicts the observed CEB distribution at the count of zero, and under-predicts counts of 1 and 2; the remaining counts are pretty close. The observed CEB distribution, compared to the univariate Poisson distribution with a mean, μ, of 1.855, has substantially fewer 0’s, and more cases in the earlier counts. Even though the two distributions are not perfect, we may conclude that we are correct in estimating the CEB dependent variable with a Poisson model
One reason for the failure of the pure Poisson distribution to perfectly fit the observed CEB number of years of education completed data is that the rate of childbearing, i.e., the (eduyrs number of babies produced, P, differs across the women The univariate Poisson distribution whether the woman lives in an urban area with a mean of 1.855 does not take into account coded 1 if yes, 0 if no: the heterogeneity of the sample women in their whether the woman is a Han Chinese values of u. So we need to extend the coded 1 if yes, 0 if no univariate Poisson distribution to the poisson whether the woman's first pregnancy regression model, in which we assume that the occurred after 1979 this variable is called observed ceb count for woman i is drawn from Poisson distribution with mean u. where A policy, and is coded 1 if yes, 0 if no is estimated from observed characteristics that is, from X variables of the women Poisson regression is used to model CEB, Here are summary descriptive data with seven X variables, namely age at menarche, in ye for the dependent variable and the age at first marriage(afm), in years seven independent variables woman's exposure to the risk of childbearing sum ceb menarche afm fecund eduyrs urban han policy (fecund), which is calculated in years for each woman is the difference between her Hean std. Dev age at menarche and either, her age at sterilization, her age at menopause, or her age when the survey was conducted whichever is less:
15 29 • One reason for the failure of the pure Poisson distribution to perfectly fit the observed CEB data is that the rate of childbearing, i.e., the number of babies produced, μ, differs across the women. The univariate Poisson distribution with a mean of 1.855 does not take into account the heterogeneity of the sample women in their values of μ. So we need to extend the univariate Poisson distribution to the Poisson regression model, in which we assume that the observed CEB count for woman i is drawn from a Poisson distribution with mean μ, where μi, is estimated from observed characteristics, that is, from X variables of the women. 30 Poisson regression is used to model CEB, with seven X variables, namely: • woman’s age at menarche, in years; • age at first marriage (afm), in years; • woman’s exposure to the risk of childbearing (fecund), which is calculated in years for each woman, is the difference between her age at menarche and either, her age at sterilization, her age at menopause, or her age when the survey was conducted, whichever is less; 16 31 • number of years of education completed (eduyrs); • whether the woman lives in an urban area, coded 1 if yes, 0 if no; • whether the woman is a Han Chinese, coded 1 if yes, 0 if no; • whether the woman’s first pregnancy occurred after 1979; this variable is called policy, and is coded 1 if yes, 0 if no. 32 Here are summary descriptive data for the dependent variable and the seven independent variables: sum ceb menarche afm fecund eduyrs urban han policy
We hypothesize that the older the woman at first menstruation, the greater the We may ask how well does this Poisson of CEB to her; the gre regression model of children ever born number of years of education, the less the ove our ability to predict the CEB; if the first pregnancy occurred after probabilities of a woman having each 1980. CEB will be lower urban women will number (i.e. each count) of children have fewer ceb than rural women We use the"prcounts command to We will now estimate a Poisson regression calculate the predicted probabilities for model, predicting CEB with the above ach count of ceB. We will call these seven x variables The stata command is poisson followed by the dependent predicted probabilities"prceb variable and then the seven x variables olsson ceb menarche afm fecund eduyrs urban han policy Stata command tertian 2: poisson ceb menarche afm fecund eduyrs urban han policy prcounts prob, plot max (10) LDg likelinond-5568955M label var prcebpreq"Prediction from PRM graph twoway connected cebprobobeq cebprobpreq prcebpreq cebprobval, title("Proportion or Probability ylabel((1).)xlabel(o(1)9)title("Distributions of CE Univariate Poisson)sub(and Poisson Regression Model) 1.148
17 33 • We hypothesize that the older the woman at first menstruation, the greater the number of CEB to her; the greater the number of years of education, the less the CEB; if the first pregnancy occurred after 1980, CEB will be lower; urban women will have fewer CEB than rural women. • We will now estimate a Poisson regression model, predicting CEB with the above seven X variables. The Stata command is poisson followed by the dependent variable and then the seven X variables. 34 poisson ceb menarche afm fecund eduyrs urban han policy 18 35 • We may ask how well does this Poisson regression model of children ever born improve our ability to predict the probabilities of a woman having each number (i.e., each count) of children. • We use the “prcounts” command to calculate the predicted probabilities for each count of CEB. We will call these predicted probabilities “prceb”. 36 Stata command: poisson ceb menarche afm fecund eduyrs urban han policy prcounts prceb, plot max(10) label var prcebpreq "Prediction from PRM" graph twoway connected cebprobobeq cebprobpreq prcebpreq cebprobval, ytitle("Proportion or Probability") ylabel(0(.1).4) xlabel(0(1) 9) title("Distributions of CEB, Univariate Poisson") sub("and Poisson Regression Model")
Distributions of CEB, Univariate Poisson Poisson goodness of fit There is a formal "goodness of fit test we may calculate, that compares the observed empirical distribution with the distribution predicted by the Poisson regression model The null hypothesis (Ho) is that there is no difference between the obseryed data and the modeled data, indicating that the model fits the data. So we are looking for a small mber of Children Ever Born value of chi-square, with a probability >0.05 Observed CEB Distribution-+Univariate Poisson, mu =1.8 If we have a small chi-square this would mean we have a model that fits the data. a The Stata command is poisgof The predicted probabilities generated by fit chi the Poisson regression model (PRM)are only slightly worse at predicting count 0 than the predictions generated by the The goodness of fit test is good news for our univariate Poisson distribution but for the model. it tells us that the model fits the data most part the PRM only results in a ery well;; specifically, the goodness of fit Chi2 modest improvement in the predictions est indicates that given the Poisson regression Both the PRM predictions, and the model we cannot reject the null hypothesis that editions generated by the univariate observed data are poisson distributed Poisson model are still somewhat off the The Stata printout for the Poisson regression actual distribution of ceb equation also shows values of Pseudo R2 and likelihood ratio)LR Chi2, which indicate that we have a statistically significant model
19 37 0 .1 .2 .3 .4 Proportion or Probability 0 1 2 3 4 5 6 7 8 9 Number of Children Ever Born Observed CEB Distribution Univariate Poisson, mu = 1.855 Prediction from PRM and Poisson Regression Model Distributions of CEB, Univariate Poisson 38 • The predicted probabilities generated by the Poisson regression model (PRM) are only slightly worse at predicting count 0 than the predictions generated by the univariate Poisson distribution; but for the most part the PRM only results in a modest improvement in the predictions. Both the PRM predictions, and the predictions generated by the univariate Poisson model, are still somewhat off the actual distribution of CEB. 20 39 Poisson Goodness of Fit • There is a formal “goodness of fit” test we may calculate, that compares the observed empirical distribution with the distribution predicted by the Poisson regression model. The null hypothesis ( H0) is that there is no difference between the observed data and the modeled data, indicating that the model fits the data. So we are looking for a small value of chi-square, with a probability > 0.05. If we have a small chi-square, this would mean we have a model that fits the data. 40 • The Stata command is poisgof • The goodness of fit test is good news for our model. It tells us that the model fits the data very well; specifically, the goodness of fit Chi2 test indicates that given the Poisson regression model we cannot reject the null hypothesis that our observed data are Poisson distributed. • The Stata printout for the Poisson regression equation also shows values of Pseudo R2 and the (likelihood ratio) LR Chi2 , which indicate that we have a statistically significant model