CONOMCS ELSEVIER Journal of Health Economics 21(2002)601-625 elsevier com/locate/ecobase The structure of demand for health care latent class versus two-part models Partha Deb a, * Pravin K. Trivedi a Department of Economics, IUPUl, Cavanaugh Hall516, 425 University boulevard. Indianapolis, IN 46202, USA b Department of Economics, Indiana University, Wylie Hall, Bloomington, IN 47405, USA Received I November 2000; accepted 1 January 2002 Abstra We contrast the two-part model(TPm)that distinguishes between users and non-users of health care, with a latent class model (lCm) that distinguishes between infrequent and frequent users. In model comparisons using data on counts of utilization from the RAND Health Insurance Experiment (RHIe), we find strong evidence in favor of the LCM. We show that individuals in the infrequent and frequent user latent classes may be described as being healthy and ill, respectively. Although sample averages of price elasticities, conditional means and event probabilities are not statistically different, the estimates of these policy-relevant measures are substantively different when calculated for hypo- thetical individuals with specific characteristics. C 2002 Elsevier Science B V. All rights reserved Keywords: Latent class model; Finite mixture model; Two-part model; Count data 1. Introduction This paper examines empirical strategies for modeling the demand for health services, measured as counts of utilization. The choice of the econometric framework has implications for a number of empirical issues of central importance in health economics, e.g. the price sensitivity of the demand for medical services, predicted use and the likelihood of being extensive users of services. The paper proposes an approach based on a finite mixture variant of the latent class model (LCM). The proposed approach is compared with the"standard two-part framework for modeling the demand for health care The literature on the demand for medical care analyzes either discrete measures, such as the number of physician or non-physician visits( Cameron et al., 1988; Pohlmeier and Ulrich, 1995; Deb and Trivedi, 1997; Gerdtham, 1997), or continuous measures such as Corresponding author. Tel:+1-317-274-5216 fax: +1-317-274-0097 E-mail address: pdeb(@iupui.edu(. Deb 0167-6296/02/S-see front matter C 2002 Elsevier Science B V. All rights reserved P:S0167-6296(02)00008-5
Journal of Health Economics 21 (2002) 601–625 The structure of demand for health care: latent class versus two-part models Partha Deb a,∗, Pravin K. Trivedi b a Department of Economics, IUPUI, Cavanaugh Hall 516, 425 University Boulevard, Indianapolis, IN 46202, USA b Department of Economics, Indiana University, Wylie Hall, Bloomington, IN 47405, USA Received 1 November 2000; accepted 1 January 2002 Abstract We contrast the two-part model (TPM) that distinguishes between users and non-users of health care, with a latent class model (LCM) that distinguishes between infrequent and frequent users. In model comparisons using data on counts of utilization from the RAND Health Insurance Experiment (RHIE), we find strong evidence in favor of the LCM. We show that individuals in the infrequent and frequent user latent classes may be described as being healthy and ill, respectively. Although sample averages of price elasticities, conditional means and event probabilities are not statistically different, the estimates of these policy-relevant measures are substantively different when calculated for hypothetical individuals with specific characteristics. © 2002 Elsevier Science B.V. All rights reserved. Keywords: Latent class model; Finite mixture model; Two-part model; Count data 1. Introduction This paper examines empirical strategies for modeling the demand for health services, measured as counts of utilization. The choice of the econometric framework has implications for a number of empirical issues of central importance in health economics, e.g. the price sensitivity of the demand for medical services, predicted use and the likelihood of being extensive users of services. The paper proposes an approach based on a finite mixture variant of the latent class model (LCM). The proposed approach is compared with the “standard” two-part framework for modeling the demand for health care. The literature on the demand for medical care analyzes either discrete measures, such as the number of physician or non-physician visits (Cameron et al., 1988; Pohlmeier and Ulrich, 1995; Deb and Trivedi, 1997; Gerdtham, 1997), or continuous measures such as ∗ Corresponding author. Tel.: +1-317-274-5216; fax: +1-317-274-0097. E-mail address: pdeb@iupui.edu (P. Deb). 0167-6296/02/$ – see front matter © 2002 Elsevier Science B.V. All rights reserved. PII: S0167-6296(02)00008-5
P Deb, PK. Trivedi/Journal of Health Economics 21(2002)601-625 expenditures(Duan et al., 1983; Manning et al., 1987; Keeler et al., 1988; McCall et al 1991). In modeling the usage of medical services, the two-part model (TPm) has served as a methodological cornerstone of empirical analysis. The first part of the TPM is a binary outcome model that describes the distinction between non-users and users. The second part describes the distribution of use conditional on some use. modeled either as a continuous or integer-valued random variable. Although in health economics the TPM is used pre- dominantly to refer to models of health expenditures, the structure of the TPM is equally applicable for discrete or continuous outcomes. The TPM for count data is often referred to as a hurdle model The appeal of the TPM is partly driven by an important feature of the demand for medical care, which is the high incidence of zero usage. For example, approximately 30% of typical cross-sectional samples of non-institutionalized individuals in the US report no outpatient visits in the survey year. However, the TPM is well supported empirically, with explanatory variables often playing different roles in the two parts of the model. The appeal of the TPM in health economics is also based on its connection to a principal-agent model(see, for example, Zweifel, 1981) where the physician(the agent)determines utilization on behalf of the patient(the principal) once initial contact is made. The following quotes highlight the strength of this argument in the literature the decision to receive some care is largely the consumers, while the physician influences the decision about the level of care(Manning et al., 1987, p. 109) while at the first stage it is the patient who determines whether to visit the physician (contact analysis), it is essentially up to the physician to determine the intensity of the treatment(frequency analysis)(Pohlmeier and Ulrich, 1995, p. 340) where the first part relates to the patient who decides whether to contact the physician (contact decision)and the second to the decision about repeated visits and/or referrals which is determined largely by the preferences of the physician(frequency decision) Gerdtham, 1997, p. 308) This sharp dichotomy between users and non-users may be appealing in modeling data on episodes of medical care but this distinction may not be tenable in the case of typical cross-sectional datasets. In these data. health care events are recorded over a fixed time period(e.g. a year or a month) and not over an episode of illness. More generally, the first part of the TPM may be thought of as modelling the decision to initiate the first episode of treatment, while the second part is a combination of the patients decisions to initiate subsequent treatment and the physicians' decisions about the intensity of each of those episodes. Unless one believes that the initiation of the first episode of care during a fixed time period has special characteristics(relative to initiation of subsequent episodes), the appeal of the TPM may, in principle, be diminished A more tenable distinction for typical cross-sectional data may be between an"infrequent user"and a"frequent user" of medical care, the difference being determined by health status, attitudes to health risk, and choice of life-style. The LCM, in which there is no distinction between users and non-users of care, but which can distinguish between groups with high average demand and low average demand, therefore provides a better framework
602 P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 expenditures (Duan et al., 1983; Manning et al., 1987; Keeler et al., 1988; McCall et al., 1991). In modeling the usage of medical services, the two-part model (TPM) has served as a methodological cornerstone of empirical analysis. The first part of the TPM is a binary outcome model that describes the distinction between non-users and users. The second part describes the distribution of use conditional on some use, modeled either as a continuous or integer-valued random variable. Although in health economics the TPM is used predominantly to refer to models of health expenditures, the structure of the TPM is equally applicable for discrete or continuous outcomes. The TPM for count data is often referred to as a hurdle model. The appeal of the TPM is partly driven by an important feature of the demand for medical care, which is the high incidence of zero usage. For example, approximately 30% of typical cross-sectional samples of non-institutionalized individuals in the US report no outpatient visits in the survey year. However, the TPM is well supported empirically, with explanatory variables often playing different roles in the two parts of the model. The appeal of the TPM in health economics is also based on its connection to a principal-agent model (see, for example, Zweifel, 1981) where the physician (the agent) determines utilization on behalf of the patient (the principal) once initial contact is made. The following quotes highlight the strength of this argument in the literature: ... the decision to receive some care is largely the consumer’s, while the physician influences the decision about the level of care (Manning et al., 1987, p. 109). ... while at the first stage it is the patient who determines whether to visit the physician (contact analysis), it is essentially up to the physician to determine the intensity of the treatment (frequency analysis) (Pohlmeier and Ulrich, 1995, p. 340). ... where the first part relates to the patient who decides whether to contact the physician (contact decision) and the second to the decision about repeated visits and/or referrals, which is determined largely by the preferences of the physician (frequency decision) (Gerdtham, 1997, p. 308). This sharp dichotomy between users and non-users may be appealing in modeling data on episodes of medical care but this distinction may not be tenable in the case of typical cross-sectional datasets. In these data, health care events are recorded over a fixed time period (e.g. a year or a month) and not over an episode of illness. More generally, the first part of the TPM may be thought of as modelling the decision to initiate the first episode of treatment, while the second part is a combination of the patient’s decisions to initiate subsequent treatment and the physicians’ decisions about the intensity of each of those episodes. Unless one believes that the initiation of the first episode of care during a fixed time period has special characteristics (relative to initiation of subsequent episodes), the appeal of the TPM may, in principle, be diminished. A more tenable distinction for typical cross-sectional data may be between an “infrequent user” and a “frequent user” of medical care, the difference being determined by health status, attitudes to health risk, and choice of life-style. The LCM, in which there is no distinction between users and non-users of care, but which can distinguish between groups with high average demand and low average demand, therefore provides a better framework
P. Deb, PK Trivedi/Journal of Health Economics 21(2002)601-625 We hypothesize that the underlying unobserved heterogeneity which splits the population into latent classes is based on an individual's latent long-term health status. Proxy variables such as self-perceived heal th status and chronic health conditions may not fully capture population heterogeneity from this source. Consequently, in the case of two latent sub- populations, a distinction may be made between the "healthy"and the "ill groups, whose demands for medical care are characterized by low mean and high mean, respectively From a statistical point of view, the TPM is also a finite mixture with a degenerate component. It combines zeros from a binomial density with the positives from a zero- truncated density. The LCM is more flexible because it permits mixing with respect to both zeros and positives. While the TPM and LCM are clearly related, they are not nested Hence it is not a priori clear which model would perform better empirically. In a study of medical care demand by the elderly (Deb and Trivedi, 1997)find that the LCM is superior to the TPM. In other empirical work Cameron and Trivedi(1998), it is shown that the TPM describes the number of recreational trips taken by individuals better than the lCm A careful comparison of the LCM and TPM is useful from a policy perspective. The TPM has been used extensively to estimate demand responses to prices, income and changes in insurance status. The results have been used to propose changes in health insurance design. Statistics of interest in many such policy exercises are non-linear functions of the underlying parameters of the conditional mean function. Therefore, consistent estimation of the conditional mean function does not ensure consistent estimates of the statisties of interest for policy exercises, see Mullahy(1998)for a detailed discussion. Estimating a model that fits the empirical distribution adequately does, on the other hand, ensure that such statistics will be estimated consistently. Moreover, if in fact the TPM is dominated by the lCm, the accumulated evidence in favor of the TPM, interpreted as evidence in favor of a principal-agent framework, is ambiguous. Policies based on the principal-agent framework might, therefore, have unintended consequences Finally, both the TP and the lCM require that the investigator specifies the probability distribution of the data. Although this is a potential source of misspecification in both cases, its impact is smaller in the case of the LCM. This is because LCM is more flexible and can serve as a better approximation to any true, but unknown, probability density (laird, 1978, Heckman and Singer, 1984). Its growing popularity is reflected in an increase in the number of regression-based applications in econometrics. Recent applications include Heckman et al. (1990), Gritz(1993), Wedel et al. (1993), Deb and Trivedi (1997), geweke and Keane (1997), Morduch and Stern(1997), and Wang et al. (1998) We use data from the RAND Health Insurance Experiment(RHIE). The RHIe is one of the largest social experiments ever completed, generating over 400 research studies by members of the RANd group(Newhouse et al., 1993). It is widely regarded as the basis of the most reliable estimates of price sensitivity of demand for medical services. For example, Burtless(1995, p. 82) has stated: "The Health Insurance Experiment improved our knowledge about the price sensitivity of demand for medical services in a way that no non-experimental study has been able to match". Therefore, the public-use data from the RHIE provide a suitable test-bed for our proposed investigations We examine two measures of counts of utilization. The covariates are among those commonly used in studies of health care demand
P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 603 We hypothesize that the underlying unobserved heterogeneity which splits the population into latent classes is based on an individual’s latent long-term health status. Proxy variables such as self-perceived health status and chronic health conditions may not fully capture population heterogeneity from this source. Consequently, in the case of two latent subpopulations, a distinction may be made between the “healthy” and the “ill” groups, whose demands for medical care are characterized by low mean and high mean, respectively. From a statistical point of view, the TPM is also a finite mixture with a degenerate component. It combines zeros from a binomial density with the positives from a zerotruncated density. The LCM is more flexible because it permits mixing with respect to both zeros and positives. While the TPM and LCM are clearly related, they are not nested. Hence it is not a priori clear which model would perform better empirically. In a study of medical care demand by the elderly (Deb and Trivedi, 1997) find that the LCM is superior to the TPM. In other empirical work Cameron and Trivedi (1998), it is shown that the TPM describes the number of recreational trips taken by individuals better than the LCM. A careful comparison of the LCM and TPM is useful from a policy perspective. The TPM has been used extensively to estimate demand responses to prices, income and changes in insurance status. The results have been used to propose changes in health insurance design. Statistics of interest in many such policy exercises are non-linear functions of the underlying parameters of the conditional mean function. Therefore, consistent estimation of the conditional mean function does not ensure consistent estimates of the statistics of interest for policy exercises; see Mullahy (1998) for a detailed discussion. Estimating a model that fits the empirical distribution adequately does, on the other hand, ensure that such statistics will be estimated consistently. Moreover, if in fact the TPM is dominated by the LCM, the accumulated evidence in favor of the TPM, interpreted as evidence in favor of a principal-agent framework, is ambiguous. Policies based on the principal-agent framework might, therefore, have unintended consequences. Finally, both the TPM and the LCM require that the investigator specifies the probability distribution of the data. Although this is a potential source of misspecification in both cases, its impact is smaller in the case of the LCM. This is because LCM is more flexible and can serve as a better approximation to any true, but unknown, probability density (Laird, 1978; Heckman and Singer, 1984). Its growing popularity is reflected in an increase in the number of regression-based applications in econometrics. Recent applications include Heckman et al. (1990), Gritz (1993), Wedel et al. (1993), Deb and Trivedi (1997), Geweke and Keane (1997), Morduch and Stern (1997), and Wang et al. (1998). We use data from the RAND Health Insurance Experiment (RHIE). The RHIE is one of the largest social experiments ever completed, generating over 400 research studies by members of the RAND group (Newhouse et al., 1993). It is widely regarded as the basis of the most reliable estimates of price sensitivity of demand for medical services. For example, Burtless (1995, p. 82) has stated: “The Health Insurance Experiment improved our knowledge about the price sensitivity of demand for medical services in a way that no non-experimental study has been able to match”. Therefore, the public-use data from the RHIE provide a suitable test-bed for our proposed investigations. We examine two measures of counts of utilization. The covariates are among those commonly used in studies of health care demand
P Deb, PK. Trivedi/Journal of Health Economics 21(2002)601-625 The rhie data are most suitable for our work in spite of the fact that they are considerably older than other nationally representative surveys like the National Medical Expenditure Survey of 1987, the National Health Interview Survey of 1994 or the Medical Expenditure Panel Survey of 1997. First, the rhie dataset is the only one in which individuals were randomized into insurance plans, thus making insurance choice exogenous. Endogeneity of insurance choice is a major problem in non-experimental data; even in cases where suitable instruments exist, they are typically weak thus making statistical corrections fo endogeneity unreliable. Second, RAND researchers gave careful consideration to issues of attrition bias and other sources of"sample contamination"which affect some social experiments(Newhouse et al., 1993, chapter 2; Heckman and Smith, 1995) In the following section of the paper, we formally present the competing models used in this paper and discuss model comparison, selection, and evaluation strategies. The data are described in Section 3. Empirical results are reported in Section 4, and we conclude in 2. Econometric models We develop models for counts of outpatient visits using the LCM and TPM frameworks Both are derived from the negative binomial model (NBM) for count data, so we begin by describing that model 2.. NBM Let yi be a count dependent variable that takes values 0, 1, 2,... The density function for the NBM is given by f(yi|6)= T((y λ+v)(x;+v where ro is the gamma function, i exp(x B)and the precision parameter(vi)is specified as vi=(1/ a)aj. The parameter a >0 is an overdispersion parameter and k is an arbitrary constant. In this specification, the conditional mean is given by E(ylx)=入i nd the variance by Voil ri)=Ai+aii The parameter k is usually held fixed in empirical work. The NBl model is obtained by specifying k= I while the NB2 is obtained by setting k=0 2.2. TPM We choose a nB density to construct the TPM because we wish to focus on the differences between a statistical structure that distinguishes infrequent and frequent users(LCM)from
604 P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 The RHIE data are most suitable for our work in spite of the fact that they are considerably older than other nationally representative surveys like the National Medical Expenditure Survey of 1987, the National Health Interview Survey of 1994 or the Medical Expenditure Panel Survey of 1997. First, the RHIE dataset is the only one in which individuals were randomized into insurance plans, thus making insurance choice exogenous. Endogeneity of insurance choice is a major problem in non-experimental data; even in cases where suitable instruments exist, they are typically weak thus making statistical corrections for endogeneity unreliable. Second, RAND researchers gave careful consideration to issues of attrition bias and other sources of “sample contamination” which affect some social experiments (Newhouse et al., 1993, chapter 2; Heckman and Smith, 1995). In the following section of the paper, we formally present the competing models used in this paper and discuss model comparison, selection, and evaluation strategies. The data are described in Section 3. Empirical results are reported in Section 4, and we conclude in Section 5. 2. Econometric models We develop models for counts of outpatient visits using the LCM and TPM frameworks. Both are derived from the negative binomial model (NBM) for count data, so we begin by describing that model. 2.1. NBM Let yi be a count dependent variable that takes values 0, 1, 2,... The density function for the NBM is given by f (yi|θ) = Γ (yi + ψi) Γ (ψi)Γ (yi + 1) ψi λi + ψi ψi λi λi + ψi yi (2.1) where Γ (·) is the gamma function, λi = exp(x iβ) and the precision parameter (ψ−1 i ) is specified as ψi = (1/α)λk i . The parameter α > 0 is an overdispersion parameter and k is an arbitrary constant. In this specification, the conditional mean is given by E(yi|xi) = λi (2.2) and the variance by V(yi|xi) = λi + αλ2−k i . (2.3) The parameter k is usually held fixed in empirical work. The NB1 model is obtained by specifying k = 1 while the NB2 is obtained by setting k = 0. 2.2. TPM We choose a NB density to construct the TPM because we wish to focus on the differences between a statistical structure that distinguishes infrequent and frequent users (LCM) from
P. Deb, PK Trivedi/Journal of Health Economics 21(2002)601-625 one that distinguishes non-users and users(TPM) while minimizing all other sources of variation. From the nB density shown in Eq(2. 1 )one can derive the probability of being a non-user as Pr1(y=0x,61) where the subscript 1 denotes parameters associated with the first part of the TPM, A1.i exp(x'P1) and (1/ a1)ii. The probability of being a user is calculated as(1- PrI(i=OLxi, 01)). The first part involves only binary information so the parameters(B1) of the mean function and the parameter aI are not separately identifiable. We set a1=1 without loss of generality In the second part of the TPM, the distribution of utilization conditional on some us is assumed to follow a truncated NB distribution. After some algebraic manipulation, one gets r(y+v)「/x2;+v2吻 f2(ylx,y>0,62)= T(2 i/(i+1) 入2,+v2,i as the conditional density of use. Note that, al though the first and second parts are derived from the nB density, the parameters are allowed to be different The first and second parts of the TPM enter multiplicatively in the likelihood functio Therefore, the likelihood function associated with the binary choice can be maximized separately from the second part, which is estimated using the truncated subsample of positive observations of yi. The mean of the count variable in this TPM is given by E(w|x)=Pr1y>0x,61)2 Pr and the variance by Pr1(y>0|x,61) 2-k Pr1(y>0|x;,61) Pr2(y7>0|x;,62) Pr2(y>0|x;,62) Both the mean and the variance in the TPM are, in general, different from their standard NB counterparts. The TPM can accommodate over and underdispersed data relative to the NBM I Although we have chosen to derive both parts of the hurdle model from parent NB distributions, we rece that users may sometimes choose to estimate the binary choice part using more familiar logit or probit models This choice is typically not significant, because, as is commonly known, the exact choice of distribution in binary choice models makes very little difference to the estimated probabilities In our case, we have also estimated logit models with almost identical results
P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 605 one that distinguishes non-users and users (TPM) while minimizing all other sources of variation. From the NB density shown in Eq. (2.1)one can derive the probability of being a non-user as Pr1(yi = 0|xi, θ 1) = ψ1,i λ1,i + ψ1,i ψ1,i , (2.4) where the subscript 1 denotes parameters associated with the first part of the TPM, λ1,i = exp(x iβ1) and ψ1,i = (1/α1)λk 1,i. The probability of being a user is calculated as (1 − Pr1(yi = 0|xi, θ 1)). The first part involves only binary information so the parameters (β1) of the mean function and the parameter α1 are not separately identifiable. We set α1 = 1 without loss of generality. In the second part of the TPM, the distribution of utilization conditional on some use is assumed to follow a truncated NB distribution. After some algebraic manipulation, one gets f2(yi|xi, yi > 0, θ 2) = Γ (yi + ψ2,i) Γ (ψ2,i)Γ (yi + 1) λ2,i + ψ2,i ψ2,i ψ2,i − 1 −1 × λ2,i λ2,i + ψ2,i yi (2.5) as the conditional density of use.1 Note that, although the first and second parts are derived from the NB density, the parameters are allowed to be different. The first and second parts of the TPM enter multiplicatively in the likelihood function. Therefore, the likelihood function associated with the binary choice can be maximized separately from the second part, which is estimated using the truncated subsample of positive observations of yi. The mean of the count variable in this TPM is given by E(yi|xi) = Pr1(yi > 0|xi, θ 1) Pr2(yi > 0|xi, θ 2) λ2,i (2.6) and the variance by V(yi|xi)= Pr1(yi > 0|xi, θ 1) Pr2(yi > 0|xi, θ 2) λ2,i+α2λ2−k 2,i + 1− Pr1(yi > 0|xi, θ 1) Pr2(yi > 0|xi, θ 2) λ2 2,i . (2.7) Both the mean and the variance in the TPM are, in general, different from their standard NB counterparts. The TPM can accommodate over and underdispersed data relative to the NBM. 1 Although we have chosen to derive both parts of the hurdle model from parent NB distributions, we recognize that users may sometimes choose to estimate the binary choice part using more familiar logit or probit models. This choice is typically not significant, because, as is commonly known, the exact choice of distribution in binary choice models makes very little difference to the estimated probabilities. In our case, we have also estimated logit models with almost identical results.
P Deb, PK. Trivedi/Journal of Health Economics 21(2002)601-625 2.3. LCM In the lCm, the random variable is postulated as a draw from a population which is ar dditive mixture of C distinct subpopulations in proportions T1,..., Tc, where 2i=Ij 1,j200=1,., C). The mixture density for observation i,i=I,., n, is given by f(:1)=∑xf(v0)+xcf(1c),i=1,…,n (28) where each term in the sum on the right-hand side is the product of the mixing probability T and the component(subpopulation) density f; (ilej). The j are unknown constants that are estimated along with all other parameters, denoted 0. Also Ic=(1-2=lI).For identification( normalization), we use the labelling restriction that T≥m2≥…≥rc which can always be satisfied by rearrangement post estimation The component densities of the C-component finite mixture are specified as 入 f (iles) 厂(吵)rO+1)(入,+yj where j=1, 2, .., C are the latent classes, Aj. i exp(x' Bi)and vj. i=(1/a j)k.Note that(Bj, aj )are unrestricted across components The conditional mean of the count variable is given by E(ylx)=A=∑可 (2.10) and the variance by Vyx)=∑x+1+一对 (211) Both the mean and the variance in the lcm are in gener erent from their standard NB counterparts. The LCM can also accommodate over and underdispersed data relative to the NBM, but does so in a different manner than the TPM. Because this density is very flexible(permitting, for example, multimodal marginal distributions) and easily captures long-right tails, it is likely to accommodate patterns of overdispersion expected in our data 2. 4. Properties oflcM On LCMs offer a flexible way of specifying mixtures of densities. There are a number advantages of using a discrete rather than a continuous mixing distribution. First, the finite mixture representation provides a natural and intuitively attractive representation of 2 In general, T; may be parameterized as a function of covariates. However, such models are often fraught with identification problems if separating information is not available. However, if separating information is available, identification is feasible(Duan et al., 1983)
606 P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 2.3. LCM In the LCM, the random variable is postulated as a draw from a population which is an additive mixture of C distinct subpopulations in proportions π1,... , πC, where C j=1 πj = 1, πj ≥ 0 (j = 1,... , C). The mixture density for observation i, i = 1,... , n, is given by f (yi|θ) = C −1 j=1 πjfj (yi|θj ) + πCfC(yi|θC), i = 1, . . . , n, (2.8) where each term in the sum on the right-hand side is the product of the mixing probability πj and the component (subpopulation) density fj (yi|θj ). The πj are unknown constants that are estimated along with all other parameters, denoted θ. 2 Also πC = (1−C−1 j=1 πj ). For identification (normalization), we use the labelling restriction that π1 ≥ π2 ≥ ··· ≥ πC, which can always be satisfied by rearrangement post estimation. The component densities of the C-component finite mixture are specified as fj (yi|θj ) = Γ (yi + ψj,i) Γ (ψj,i)Γ (yi + 1) ψj,i λj,i + ψj,i ψj,i λj,i λj,i + ψj,i yi , (2.9) where j = 1, 2,... , C are the latent classes, λj,i = exp(x iβj )and ψj,i = (1/αj )λk j,i. Note that (βj , αj )are unrestricted across components. The conditional mean of the count variable is given by E(yi|xi) = λ¯i = C j=1 πjλji (2.10) and the variance by V(yi|xi) = C j=1 πjλ2 ji[1 + αjλ−k ji ] + λ¯i − λ¯ 2 i . (2.11) Both the mean and the variance in the LCM are, in general, different from their standard NB counterparts. The LCM can also accommodate over and underdispersed data relative to the NBM, but does so in a different manner than the TPM. Because this density is very flexible (permitting, for example, multimodal marginal distributions) and easily captures long-right tails, it is likely to accommodate patterns of overdispersion expected in our data. 2.4. Properties of LCM LCMs offer a flexible way of specifying mixtures of densities. There are a number of advantages of using a discrete rather than a continuous mixing distribution. First, the finite mixture representation provides a natural and intuitively attractive representation of 2 In general, πj may be parameterized as a function of covariates. However, such models are often fraught with identification problems if separating information is not available. However, if separating information is available, identification is feasible (Duan et al., 1983).
P. Deb, PK Trivedi/Journal of Health Economics 21(2002)601-625 heterogeneity in a finite, usually small, number of latent classes, each of which may be regarded as a"type, or a"group". Second, the finite mixture approach is semiparametric it does not require any distributional assumptions for the mixing variable. The approach is an alternative to either nonp-arametric estimation or forcing the data through the straitjacket of a one-component parametric density. Third, the results of Laird(1978)and Heckman an Singer (1984)suggest that estimates of such finite mixture models may provide good numer ical approximations even if the underlying mixing distribution is continuous. The structure of the moments given above shows how the mixture model"decomposes" the information contained in a one-component model. Finally, the choice of a continuous mixing density for some parametric count models is sometimes restrictive and computationally intractable because the marginal density may not have an analytical solution. Note that in the NB model, the response of E(y)to a covariate x is fixed by exp(xB).If observations in the right tail have a different response to changes in x, the NB model could not capture that effect. The TPM loosens the parametric straitjacket by allowing different parameters for Pr(y = 0)and E(yly >0). However, the TPM is not likely to capture differential responses to changes in x in the right tail of the distribution because the re- sponse of E(yly >0)to a covariate x is fixed by exp(xB). In the LCM, the response of E()to a covariate x is determined by two or more sets of interactions between parame- ters and covariates(depending on the number of components), therefore is more likely to accommodate differential responsiveness A finite mixture characterization is especially attractive if the mixture components have a natural interpretation. However, this is not essential. A finite mixture may be simply a way of flexibly and parsimoniously modeling the data, with each mixture component providing a local approximation to some part of the true distribution. A caveat to the foregoing dise sion is that the lCm may fit the data better simply because outliers, infuential observations or contaminated observations are present in the data. The LCM model will capture this phe nomenon through additional mixture components. Hence it is desirable that the hypothesis of LCM should be supported both by a priori reasoning and by meaningful a posteriori differences in the behavior of latent classes 2.5. Maximum likelihood and cluster-robust standard errors Both TPM and LCM are estimated using(pseudo) maximum likelihood. The standarc TPM is computationally simple because the two parts of the likelihood function can be estimated separately. On the other hand, estimation of LCM is not straightforward. A comprehensive discussion of maximum likelihood estimation of the LCM model can be found in McLachlan and Peel (2000). The likelihood functions of finite mixture models can have multiple local maxima so it is important to ensure that the algorithm converges to the global maximum. In general, random perturbation or grid search techniques, or al gorithms such as simulated annealing( Goffe et al, 1994), designed to seek the global optimum, may be utilized. In this study, to ensure against the possibility of achievi 3 Lindsay (1995)provides a detailed theoretical analysis; Haughton( 1997) surveys computational issues and available software We thank an anonymous referee for suggesting this intuition
P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 607 heterogeneity in a finite, usually small, number of latent classes, each of which may be regarded as a “type”, or a “group”. Second, the finite mixture approach is semiparametric: it does not require any distributional assumptions for the mixing variable. The approach is an alternative to either nonp-arametric estimation or forcing the data through the straitjacket of a one-component parametric density. Third, the results of Laird (1978)and Heckman and Singer (1984) suggest that estimates of such finite mixture models may provide good numerical approximations even if the underlying mixing distribution is continuous. The structure of the moments given above shows how the mixture model “decomposes” the information contained in a one-component model.3 Finally, the choice of a continuous mixing density for some parametric count models is sometimes restrictive and computationally intractable because the marginal density may not have an analytical solution. Note that in the NB model, the response of E(y) to a covariate x is fixed by exp(xβ). If observations in the right tail have a different response to changes in x, the NB model could not capture that effect. The TPM loosens the parametric straitjacket by allowing different parameters for Pr(y = 0) and E(y|y > 0). However, the TPM is not likely to capture differential responses to changes in x in the right tail of the distribution because the response of E(y|y > 0) to a covariate x is fixed by exp(xβ). In the LCM, the response of E(y) to a covariate x is determined by two or more sets of interactions between parameters and covariates (depending on the number of components), therefore is more likely to accommodate differential responsiveness.4 A finite mixture characterization is especially attractive if the mixture components have a natural interpretation. However, this is not essential. A finite mixture may be simply a way of flexibly and parsimoniously modeling the data, with each mixture component providing a local approximation to some part of the true distribution. A caveat to the foregoing discussion is that the LCM may fit the data better simply because outliers, influential observations or contaminated observations are present in the data. The LCM model will capture this phenomenon through additional mixture components. Hence it is desirable that the hypothesis of LCM should be supported both by a priori reasoning and by meaningful a posteriori differences in the behavior of latent classes. 2.5. Maximum likelihood and cluster-robust standard errors Both TPM and LCM are estimated using (pseudo) maximum likelihood. The standard TPM is computationally simple because the two parts of the likelihood function can be estimated separately. On the other hand, estimation of LCM is not straightforward. A comprehensive discussion of maximum likelihood estimation of the LCM model can be found in McLachlan and Peel (2000). The likelihood functions of finite mixture models can have multiple local maxima so it is important to ensure that the algorithm converges to the global maximum. In general, random perturbation or grid search techniques, or algorithms such as simulated annealing (Goffe et al., 1994), designed to seek the global optimum, may be utilized. In this study, to ensure against the possibility of achieving 3 Lindsay (1995) provides a detailed theoretical analysis; Haughton (1997) surveys computational issues and available software. 4 We thank an anonymous referee for suggesting this intuition
P Deb, PK Trivedi/Journal of Health Economics 21(2002)601-625 false) convergence at a local maximum, each model was estimated using a number of different sets of starting values. No problems of convergence to local maxima were ob- served. Moreover, if a model with too many points of support is chosen, one or more points of support may be degenerate, i.e. the s associated with those densities may be zero. In such cases, the solution to the maximum likelihood problem lies on the bound- ry of the parameter space. This can cause estimation algorithms to fail, especially if unconstrained maximization algorithms are used. Constrained maximization algorithms are preferred. We estimate LCM models by maximum likelihood using the broyden- Fletcher-Goldfarb-Shanno quasi-Newton constrained maximization algorithm in SAS/IML (SAS Institute, 1997) Although the standard, sandwich variance formula(Cameron and Trivedi, 1998, p. 31) is robust to certain types of misspecification, it does not account for cluster effects in the sample. Because the rhie data are clustered by construction, adjustments for such clustering are desirable. To do so, the sandwich variance matrix formula is extended to the ase of clustered observations. v(6) a2 log fi G、(、 a log fy, a log fu a2 log fi where fi= f(ilri, e) is the data density function, Ng denotes the number of elements in group(cluster)g, G is the number of groups(clusters), and 6 denotes the vector of all unknown parameters. This approach has the advantage that no specific assumption about the form of intra-cluster dependence is necessary. Intuitively, one averages over the likelihood scores within each cluster instead of treating cluster elements as independent. In all goodness of fit( GoF) and hypothesis test statistics we use this"cluster-robustvariance in place of the“ standard formula 2.6. Model comparison and selection NBM is nested within TPM and LCM. but TPM and LCM are non-nested. Therefore we use three criteria to compare models in-sample, each of which is designed for the comparison of non-nested models. These include two traditional model selection criteria based on penal- Ized likelihood, Akaike Information Criterion, (AIC), and Bayesian Information Criterion, (BIC), which are valid even in the presence of model misspecification(Sin and White, 1996) Models with smaller values of the AIC =-In L+2K and BlC =-2 In L+K In(N) where In L is the maximized log likelihood, K is the number of parameters in the mode and N is the sample size, are preferred. We also use Andrew's GoF test(Andrews, 1988a,b) The GoF test is based on a x diagnostic statistic given by S=(( -f)where f-f is the n x q matrix of differences between sample and fitted cell frequencies, q is the number of cells created by partitioning of the data, and E is its estimated covariance matrix Under the null hypothesis of no misspecification the test has an asymptotic x2(-1)distri- bution(Andrews, 1988b). when the test statistic is formed using the maximum likelihood estimator, computation of the test statistic is simplified. Let A be the N x q matrix with ith row given by fi-fi, let B be the Nx K matrix with ith row given by (a/a8)log fi(i18)
608 P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 (false) convergence at a local maximum,each model was estimated using a number of different sets of starting values. No problems of convergence to local maxima were observed. Moreover, if a model with too many points of support is chosen, one or more points of support may be degenerate, i.e. the πs associated with those densities may be zero. In such cases, the solution to the maximum likelihood problem lies on the boundary of the parameter space. This can cause estimation algorithms to fail, especially if unconstrained maximization algorithms are used. Constrained maximization algorithms are preferred. We estimate LCM models by maximum likelihood using the Broyden– Fletcher–Goldfarb–Shanno quasi-Newton constrained maximization algorithm in SAS/IML (SAS Institute, 1997). Although the standard, sandwich variance formula (Cameron and Trivedi, 1998, p. 31) is robust to certain types of misspecification, it does not account for cluster effects in the sample. Because the RHIE data are clustered by construction, adjustments for such clustering are desirable. To do so, the sandwich variance matrix formula is extended to the case of clustered observations: V (ˆ θ) = N i=1 ∂2 log fi ∂θ ∂θ −1 G j=1 Ng i=1 ∂ log fij ∂θ ∂ log fij ∂θ N i=1 ∂2 log fi ∂θ ∂θ −1 where fi = f (yi|xi,θ) is the data density function, Ng denotes the number of elements in group (cluster) g, G is the number of groups (clusters), and θ denotes the vector of all unknown parameters. This approach has the advantage that no specific assumption about the form of intra-cluster dependence is necessary. Intuitively, one averages over the likelihood scores within each cluster instead of treating cluster elements as independent. In all goodness of fit (GoF) and hypothesis test statistics we use this “cluster-robust” variance in place of the “standard” formula. 2.6. Model comparison and selection NBM is nested within TPM and LCM, but TPM and LCM are non-nested. Therefore, we use three criteria to compare models in-sample, each of which is designed for the comparison of non-nested models. These include two traditional model selection criteria based on penalized likelihood, Akaike Information Criterion, (AIC), and Bayesian Information Criterion, (BIC), which are valid even in the presence of model misspecification (Sin and White, 1996). Models with smaller values of the AIC = − ln L + 2K and BIC = −2 ln L + K ln (N ), where ln L is the maximized log likelihood, K is the number of parameters in the model and N is the sample size, are preferred. We also use Andrew’s GoF test (Andrews, 1988a,b). The GoF test is based on a χ2 diagnostic statistic given by S = (f −fˆ) Σˆ −1 (f −fˆ) where f −fˆ is the N ×q matrix of differences between sample and fitted cell frequencies, q is the number of cells created by partitioning of the data, and Σˆ is its estimated covariance matrix. Under the null hypothesis of no misspecification the test has an asymptotic χ2(q −1) distribution (Andrews, 1988b). When the test statistic is formed using the maximum likelihood estimator, computation of the test statistic is simplified. Let A be the N × q matrix with ith row given by f i −fˆ i, let B be the N ×K matrix with ith row given by (∂/∂θ)log fi(yi|θ)
P. Deb, PK. Trivedi/Jourmal of Health Economics 21(2002)601-625 and let H=[A B]. The GoF=I'H(HH+H'l where l is a column vector of ones, i.e. GoF is NR2 from the regression of 1 on H(see Andrews, 1988a, Appendix 5, for details). Our implementation of the GoF adjusts for cluster effects by first summing the elements of H within clusters There is an important point of detail concerning our use of the above test. Because our likely to lead to the rejection of all models we will consider. This is a well-known difficulty of hypothesis testing with fixed significance levels in the classical framework. Moreover previous investigations of the properties of this test suggest that, at conventional critical values, it leads to overrejection of the true null(Deb and Trivedi, 1997; Cameron and Trivedi, 1998). However, it seems appropriate to rank models by the P-values associated with the test, the model with the largest P-value being preferred. Hence in addition to the formal x-test we use the size of the statistic informally as a measure of fit with a smaller statistic indicating better fit. Furthermore, we also graphically compare empirical and fitted cell probabilities 2.7. Cross-validation A common criticism of in-sample model selection methods is that they induce over-fitting in the case of complicated models. Consequently, the selected model may not be the best model. This bias can be avoided by treating one sample as a"training sample"used for estimation, and then using a second"hold-out" sample for forecast comparison sing parameter estimates from the training sample, we calculate three measures of performance for each model using the hold-out sample. The log-likelihood value is the most direct measure of the out-of-sample fit of the model. In order to continue to penalize models with large numbers of parameters, we also use the AlC. We do not use the BiC because it adds a penalty for the sample size in addition to a penalty for the number of parameters which is not appropriate in a cross-validation exercise. Finally, we use a modified version of the andrews statistic as a heuristic with the expectation that models with better fit will have smaller values of the modified GoF in the hold-out sample. We modify Eq. (2.12)to exactly maximized the likelihood function in the training sample MGoF= 1'A(A'A)+A'l where H is replaced by A, i.e. we assume that the parameters 3. Data and summary statistics e use data from the RhIe for this study. The experiment, conducted by the rand Corporation from 1974 to 1982, is the longest and largest controlled social experiment in medical care research. The main goal of the experiment was to assess how a patients use of health services is affected by types of health insurance, including both fee-for-service and health maintenance organizations(HMOs). In the rhie, data were collected from about 8000 enrollees in 2823 families, from six sites across the country. Each family was enrolled in one of fourteen different His insurance plans for either 3 or 5 years. The plans
P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 609 and let H = [A B]. Then GoF = 1 H(H H) +H 1 (2.12) where 1 is a column vector of ones, i.e. GoF is NR2 from the regression of 1 on H (see Andrews, 1988a, Appendix 5, for details). Our implementation of the GoF adjusts for cluster effects by first summing the elements of H within clusters. There is an important point of detail concerning our use of the above test. Because our sample size is quite large, a model comparison based on significance tests of fixed size is likely to lead to the rejection of all models we will consider. This is a well-known difficulty of hypothesis testing with fixed significance levels in the classical framework. Moreover, previous investigations of the properties of this test suggest that, at conventional critical values, it leads to overrejection of the true null (Deb and Trivedi, 1997; Cameron and Trivedi, 1998). However, it seems appropriate to rank models by the P-values associated with the test, the model with the largest P-value being preferred. Hence in addition to the formal χ2-test we use the size of the statistic informally as a measure of fit with a smaller statistic indicating better fit. Furthermore, we also graphically compare empirical and fitted cell probabilities. 2.7. Cross-validation A common criticism of in-sample model selection methods is that they induce over-fitting in the case of complicated models. Consequently, the selected model may not be the best model. This bias can be avoided by treating one sample as a “training sample” used for estimation, and then using a second “hold-out” sample for forecast comparison. Using parameter estimates from the training sample, we calculate three measures of performance for each model using the hold-out sample. The log-likelihood value is the most direct measure of the out-of-sample fit of the model. In order to continue to penalize models with large numbers of parameters, we also use the AIC. We do not use the BIC because it adds a penalty for the sample size in addition to a penalty for the number of parameters which is not appropriate in a cross-validation exercise. Finally, we use a modified version of the Andrews statistic as a heuristic with the expectation that models with better fit will have smaller values of the modified GoF in the hold-out sample. We modify Eq. (2.12) to MGoF = 1 A(A A)+A 1 where H is replaced by A, i.e. we assume that the parameters exactly maximized the likelihood function in the training sample. 3. Data and summary statistics We use data from the RHIE for this study. The experiment, conducted by the RAND Corporation from 1974 to 1982, is the longest and largest controlled social experiment in medical care research. The main goal of the experiment was to assess how a patient’s use of health services is affected by types of health insurance, including both fee-for-service and health maintenance organizations (HMOs). In the RHIE, data were collected from about 8000 enrollees in 2823 families, from six sites across the country. Each family was enrolled in one of fourteen different HIS insurance plans for either 3 or 5 years. The plans
P Deb, PK. Trivedi/Journal of Health Economics 21(2002)601-625 ranged from free care to 95% coinsurance below a maximum dollar expenditure(MDE),to assignment in a prepaid group practice. Data were collected on the enrollee's use of medical care services and heal th status throughout the term of enrollment, randomly assigned for 3 or 5 years. Certain categories of individuals were excluded, e.g. the medicare eligibles and members of the military. Detailed information on the experimental design and data collection methods are reported in Morris(1979), Taylor et al. (1987), and Newhouse et al The sample used in this study consists of individuals in the fee-for-service plans only. Will anning kindly provided us with a data file consisting of the identifiers for the observations used in Manning et al. (1987)along with utilization and a number of covariates. These data were merged with the public-use RHIE files available from the Inter-University Consortium for Political and Social Research(ICPSR) to obtain some additional variables. The final sample consists of20, 186 observations; each observation represents data for an experimental ubject in a given year. We consider two measures of utilization: the number of contacts with an physician (MDU) and the total number of outpatient contacts with an physician or other health professional (OPU). Summary statistics for these variables are reported in Table 1. The insurance plan variables, defined following Manning et al. (1981), are coinsurance rate, a dummy variable Table 1 ariable definitions and summary statistics Variable Definition Mean MDU Number of outpatient visits to an MD 86 4.50 Number of outpatient visits to all providers 3.546 6.30 n( coinsurance+1),0≤ coInsurance≤l00 If individual deductible plan: 1, otherwise: 0.220 In(max(I, annual participation incentive payment) 4.709 2.69 FMDE If IDP= 1: 0 In(max(l, MDE/(0.01 coinsurance)))otherwise 8.708 1.228 LFAM In(family size) 0.539 Age In ve 25.718 16.768 FEMALE If person is female: I 0.517 0.500 CHILD If age is less than 18: 1 FEMCHILD FEMALE CHILD BLACK If race of household head is black. I 0.182 0.383 EDUCDEC Education of the household head in years 2.806 PHYSLIM If the person has a physical limitation: I 0.12 0.322 DISEASE Index of chronic diseases 6.742 HLTHG 0.481 HTHE eee 0.077 0.267 HLTHP 0.015 Omitted is excellent self-rated health Notes: MDE denotes dollar expenditure, the medical expenditure liability limit set in th above which the participant would not be responsible for cost-sharing Let ND denote the number of tions, and min(ND)and max(ND) be the minimum and maximum values of ND in the sample. Then I 100"ND-min(ND)/max(ND)-min(ND)
610 P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 ranged from free care to 95% coinsurance below a maximum dollar expenditure (MDE), to assignment in a prepaid group practice. Data were collected on the enrollee’s use of medical care services and health status throughout the term of enrollment, randomly assigned for 3 or 5 years. Certain categories of individuals were excluded, e.g. the medicare eligibles and members of the military. Detailed information on the experimental design and data collection methods are reported in Morris (1979), Taylor et al. (1987), and Newhouse et al. (1993). The sample used in this study consists of individuals in the fee-for-service plans only. Will Manning kindly provided us with a data file consisting of the identifiers for the observations used in Manning et al. (1987) along with utilization and a number of covariates. These data were merged with the public-use RHIE files available from the Inter-University Consortium for Political and Social Research (ICPSR) to obtain some additional variables. The final sample consists of 20,186 observations; each observation represents data for an experimental subject in a given year. We consider two measures of utilization: the number of contacts with an physician (MDU) and the total number of outpatient contacts with an physician or other health professional, (OPU). Summary statistics for these variables are reported in Table 1. The insurance plan variables, defined following Manning et al. (1981), are coinsurance rate, a dummy variable Table 1 Variable definitions and summary statistics Variable Definition Mean S.D. MDU Number of outpatient visits to an MD 2.861 4.505 OPU Number of outpatient visits to all providers 3.546 6.306 LC ln(coinsurance + 1), 0 ≤ coinsurance ≤ 100 1.710 1.9625 IDP If individual deductible plan: 1, otherwise: 0 0.220 0.414 LPI ln(max(1, annual participation incentive payment)) 4.709 2.697 FMDE If IDP = 1: 0 3.153 3.641 ln(max(1, MDE/(0.01 coinsurance))) otherwise LINC ln(family income) 8.708 1.228 LFAM ln(family size) 1.248 0.539 AGE Age in years 25.718 16.768 FEMALE If person is female: 1 0.517 0.500 CHILD If age is less than 18: 1 0.402 0.490 FEMCHILD FEMALE ∗ CHILD 0.194 0.395 BLACK If race of household head is black: 1 0.182 0.383 EDUCDEC Education of the household head in years 11.967 2.806 PHYSLIM If the person has a physical limitation: 1 0.124 0.322 DISEASE Index of chronic diseases 11.244 6.742 HLTHG If self-rated health is good: 1 0.362 0.481 HLTHF If self-rated health is fair: 1 0.077 0.267 HLTHP If self-rated health is poor: 1 0.015 0.121 Omitted category is excellent self-rated health Notes: MDE denotes maximum dollar expenditure, the medical expenditure liability limit set in the experiment above which the participant would not be responsible for cost-sharing. Let ND denote the number of disease conditions, and min(ND) and max(ND) be the minimum and maximum values of ND in the sample. Then DISEASE = 100∗[ND − min(ND)]/[max(ND) − min(ND)]