In this chapter we discuss the generalized estimating equations (GEE) approach for analyzing longitudinal data. Over the past 20 years, the GEE approach has proven to be an exceedingly useful method for the analysis of longitudinal data, especially when the response variable is discrete (e.g., binary, ordinal, or a count). When the longitudinal response is discrete, linear models (e.g., linear mixedeffects models) are not very appealing for relating changes in the mean response to covariates for at least two main reasons. First, with a discrete response there is intrinsic dependence of the variability on the mean. Second, the range of the mean response (e.g., a proportion or rate for a response that is binary or a count, respectively) is constrained. In the setting of regression modeling of a univariate response, both of these aspects of the response can be conveniently accommodated within generalized linear models via known variance and link functions.
In this chapter we discuss the generalized estimating equations (GEE) approach for analyzing longitudinal data. Over the past 20 years, the GEE approach has proven to be an exceedingly useful method for the analysis of longitudinal data, especially when the response variable is discrete (e.g., binary, ordinal, or a count). When the longitudinal response is discrete, linear models (e.g., linear mixedeffects models) are not very appealing for relating changes in the mean response to covariates for at least two main reasons. First, with a discrete response there is intrinsic dependence of the variability on the mean. Second, the range of the mean response (e.g., a proportion or rate for a response that is binary or a count, respectively) is constrained. In the setting of regression modeling of a univariate response, both of these aspects of the response can be conveniently accommodated within generalized linear models via known variance and link functions.
However, a straightforward application of generalized linear models to longitudinal data is not appropriate, due to the lack of independence among repeated measures obtained on the same individual. There has been extensive statistical literature on extending generalized linear models to the longitudinal data setting. One approach for accounting for the withinsubject association is via the introduction of random effects in generalized linear models. This leads to a class of models known as generalized linear mixed models (GLMMs); see Chapter 4 for a very comprehensive and expository discussion of these models. Generalized linear mixed models represent one way to extend generalized linear models to longitudinal data; they extend in a natural way the conceptual approach represented by the linear mixedeffects models for continuous responses (e.g., Laird and Ware, 1982; see also Chapter 1). There is a second approach for extending generalized linear models to longitudinal data that leads to a class of regression models that are known as marginal models. The term marginal in this context is used to emphasize that the model for the mean response at each occasion depends only on the covariates of interest, and does not incorporate dependence on random effects or previous responses. This is in contrast to GLMMs, where the mean response is modeled not only as a function of covariates but is conditional also on random effects. The most salient feature of marginal models is a regression model, with appropriately specified link function, relating the mean response at each occasion to the covariates. For estimation of the regression model parameters, marginal models do not necessarily require distributional assumptions for the vector of longitudinal responses. When full distributional assumptions for the vector of responses are avoided, the marginal model is said to be semiparametric, and this leads to a method of estimation known as generalized estimating equations (Liang and Zeger, 1986), the main focus of this chapter.
Thus, the GEE approach has its basis in one particular type of extension of generalized linear models to longitudinal, or more generally, clustercorrelated data. Before outlining the main features and properties of GEE methods in Section 3.2, it is useful to review some of its predecessors and consider the reasons why this method has been so widely adopted for longitudinal analyses.
Prior to the seminal companion papers on GEE by Liang and Zeger (1986) and Zeger and Liang (1986), statistical methods for the analysis of discrete longitudinal data had lagged somewhat behind the corresponding developments for continuous outcomes. The early foundations for statistical methods for estimation of marginal models for repeated categorical responses can be traced to a general approach developed by Grizzle, Starmer, and Koch (1969); this approach soon became known as the GSK method. Although the GSK method was developed originally as a very general method for the analysis of categorical data, Koch and Reinfurt (1971) and Koch et al. (1977) recognized that the method could be applied to the analysis of repeated measurements. The GSK method is founded on a weighted leastsquares (WLS) approach that makes few assumptions about the withinsubject association among the repeated categorical outcomes. Specifically, this WLS approach is based on a multinomial sampling model for the joint distribution of the vector of categorical outcomes and relies on the asymptotic normality of estimators of the corresponding multinomial probabilities. Recall that if a categorical outcome, with C levels, is measured repeatedly at n occasions, there are C ^{ n } possible response profiles; thus, the joint distribution of the vector of longitudinal responses is multinomial with C ^{ n } response probabilities (C^{n} −1 nonredundant probabilities because the multinomial probabilities are constrained to sum to 1). In general, for a marginal model for the repeated categorical responses, the main interest is focused on the n × (C − 1) correlated marginal probabilities or transformations of these marginal probabilities (e.g., logit transformations). Moreover, these marginal probabilities can simply be expressed as linear transformations of the underlying multinomial probabilities. In the GSK method, the sample proportions (or means) at each occasion are obtained within each covariate stratum and grouped together to form a sample mean vector of length n × (C − 1); the sample covariance matrix is also estimated. A marginal model is specified by relating the sample means (or known functions of the sample means such as the empirical logits) at each occasion to a linear function of stratum covariates. Appealing to the multivariate central limit theorem, and the use of the socalled delta method, the observed sample means (or functions of the sample means) have approximate normal distributions, given sufficient sample sizes within each stratum. The WLS method, with the sample mean vectors as the outcome vectors and the inverse of the sample covariance matrices as the weight matrices, is then used to estimate the marginal model regression parameters.
At the time of its introduction, the GSK method provided a method for estimating a very general family of models for repeated categorical data, allowing nonlinear link functions to relate the marginal expectations of the longitudinal responses to covariates. For example, the method allows the fitting of logistic regression models to repeated binary outcomes. At the heart of the GSK method are the notions of stratification of subjects on the basis of covariates and nonparametric estimation of the multinomial probabilities for the vector of repeated categorical responses within each stratum. Because the multinomial probabilities are estimated nonparametrically as the sample proportions within strata, asymptotically, the GSK method is equivalent to maximum likelihood (ML) estimation.
However, the GSK method has a number of restrictions that limit its usefulness for the analysis of longitudinal data. Because it relies on stratification, the method requires that all covariates be categorical or the reduction of quantitative covariates to categorical variables. Moreover, it also requires a relatively large number of subjects within each stratum to estimate the C ^{ n } multinomial probabilities; note that with C = 4 response levels for an outcome measured repeatedly at n = 6 occasions, there are 4^{6−1} or 4095 nonredundant multinomial probabilities. Implicitly, this means that the method can only be validly applied when (i) the number of repeated measures, n, is relatively small, thereby ensuring that the number of multinomial probabilities to be estimated does not grow exponentially; (ii) the number of covariate strata is relatively small; and (iii) the number of individuals is relatively large, ensuring a sufficient number of subjects within each stratum. Finally, the GSK method, as originally developed, also requires that the longitudinal study design be balanced on time in the sense that all subjects are measured at the same set of occasions; later refinements to the GSK method can handle balanced longitudinal designs with incompleteness due to missing observations.
As an alternative to GSK methods, full likelihoodbased methods for estimating marginal models for repeated categorical data can, in principle, accommodate many of these common features of longitudinal studies (e.g., covariates that are both quantitative and categorical, missing data, and so on). However, in practice, ML fitting of marginal models for discrete longitudinal data has proven to be very challenging. Among the many challenges are: (i) it can be conceptually difficult to model higherorder associations in a flexible and interpretable manner that is consistent with the model for the marginal expectations; (ii) given a marginal model for the vector of repeated outcomes, the multinomial probabilities cannot, in general, be expressed in closed form as a function of the model parameters; and (iii) the number of multinomial probabilities grows exponentially with the number of repeated measures. As a result, ML estimation is feasible only for a relatively small number of repeated measures. Thus, although the development of likelihoodbased methods for marginal models has been fertile ground for methodological research (e.g., Bahadur, 1961; McCullagh and Nelder, 1989; Zhao and Prentice, 1990; Lipsitz, Laird, and Harrington, 1990; Liang, Zeger, and Qaqish, 1992; Becker and Balagtas, 1993; Fitzmaurice, Laird, and Rotnitzky, 1993; Molenberghs and Lesaffre, 1994; Lang and Agresti, 1994; Glonek and McCullagh, 1995; Bergsma and Rudas, 2002; and many others), to date, ML methods that have been developed have also proven to be of only limited practical use.
Thus, by the end of the 1970s, the GSK method provided the first unified approach for extending generalized linear models to repeated categorical data, based on an application of noniterative weighted least squares. The method was implemented in widely available statistical software (e.g., proc catmod in SAS). However, the method lacked versatility and could not accommodate many of the common features of longitudinal studies, namely mixtures of quantitative and categorical, timeinvariant and timevarying, covariates and inherently unbalanced designs with both mistimed measurements and missing data. So, by the early 1980s, the time was ripe for new methods for the analysis of discrete longitudinal data that could handle all of these aspects of longitudinal data in a relatively flexible way.
Finally, we note that the WLS estimation at the heart of the GSK method is similar in spirit to noniterative empirical logistic regression for binomial data (see, for example, Cox and Snell, 1989). Empirical logistic regression is also applicable only when individuals can be grouped into strata based on their covariates; at the time of its introduction, this greatly limited the usefulness of empirical logistic regression in practice. However, as computing advanced, empirical logistic (WLS) regression for binomial data was soon replaced by ML methods that were far more versatile. In a similar vein, the GSK method for discrete longitudinal data was replaced not by ML methods, for all the reasons mentioned earlier, but by the GEE approach developed by Liang and Zeger (1986). By and large, the GEE approach has overcome most of the limitations of its predecessors and has fundamentally changed the way empirical researchers can approach the analysis of discrete longitudinal data.
In this section we briefly review the main features of marginal models for longitudinal data and discuss the key properties of the GEE approach. Because the GEE approach can be considered a multivariate extension of quasilikelihood estimation, we first discuss estimation of the regression parameters in generalized linear models for a univariate outcome, before considering estimation of the regression parameters in a marginal model for longitudinal outcomes.
Before we begin our discussion of marginal models and the GEE approach, we introduce some notation. We assume that N subjects are measured repeatedly over time. We let Y_{ij} denote the response variable for the ith subject on the jth measurement occasion (i = 1, ..., N; j = 1, ..., n_{i} ). The response variable can be continuous or discrete (e.g., binary, ordinal, polytomous, or a count). The type of response variable (e.g., binary or count) will have implications for model specification. In general, we do not assume that subjects have the same number of repeated measures or that they are measured at a common set of occasions. To accommodate such unbalanced longitudinal data (i.e., repeated measurements that are not obtained at a common set of occasions), we assume that there are n_{i} repeated measurements of the response on the ith subject and that each Y_{ij} is observed at time t_{ij} . We can group the responses into an n_{i} × 1 vector of responses denoted by Y _{ i }. Finally, associated with each response, Y_{ij} , there is a p × 1 vector of covariates, X _{ij} . We note that X _{ij} may include covariates whose values do not change throughout the duration of the study and covariates whose values change over time. The former are referred to as timestationary or betweensubject covariates (e.g., gender and fixed experimental treatments or interventions), whereas the latter are referred to as timevarying or withinsubject covariates (e.g., time since baseline, current smoking status, and environmental exposures that can vary over time). We can group the vectors of covariates into an n_{i} × p matrix of covariates denoted by X_{i} .
As mentioned in Section 3.1, the defining feature of marginal models is a regression model relating the mean response at each occasion, via a suitable link function, to the covariates. With a marginal model, the main focus is on making inferences about population means. As a result, marginal models for longitudinal data separately model the mean response and the withinsubject association among the repeated responses. In a marginal model, the goal is to make inferences about the former, whereas the latter is regarded as a nuisance characteristic of the data that must be taken into account in order to make correct inferences about changes in the population mean response over time.
A marginal model for longitudinal data has the following threepart specification:
This threepart specification of a marginal model makes the extension of generalized linear models to longitudinal data more transparent. The first two parts of the marginal model correspond to the standard generalized linear model, albeit with no explicit distributional assumptions about the responses. It is the third component, the incorporation of the withinsubject association among the repeated responses from the same individual, that represents the main extension of generalized linear models to longitudinal data.
It is also worth emphasizing that this threepart specification of a marginal model can, in principle, be extended by making full distributional assumptions about the vector of responses. To do so would require that all two and higherway associations be specified in the third component of the model. However, for reasons mentioned in Section 3.1, ML fitting of marginal models for discrete longitudinal data has proven to be very challenging. Consequently, the third component of a marginal model is typically specified in terms of the twoway or pairwise association among the repeated responses from the same individual.
In summary, marginal models are a very natural way to extend generalized linear models to longitudinal responses. Marginal models specify a generalized linear model for the longitudinal responses at each occasion but also include a model for the withinsubject association among the responses. A crucial aspect of marginal models is that the mean response and withinsubject association are modeled separately. This separation of the modeling of the mean response and the association among responses has important implications for interpretation of the regression parameters in the model for the mean response. In particular, the regression parameters, β , in the marginal model have socalled populationaveraged interpretations. That is, they describe how the mean response in the population is related to the covariates. For example, regression parameters in a marginal model might have interpretation in terms of contrasts of the changes in the mean responses in subpopulations (e.g., different treatment, intervention, or exposure groups); see Chapter 7 for a detailed discussion of various aspects of interpretation of marginal model parameters.
As noted in Section 3.2.2, the threepart marginal model specification does not require full distributional assumptions for the repeated responses, only a regression model for the mean response. The avoidance of distributional assumptions can be advantageous because, in general, there is no convenient specification of the joint multivariate distribution of Y _{ i } for marginal models when the responses are discrete. When full distributional assumptions are avoided, the model is referred to as being semiparametric because there is a parametric component β and a nonparametric component determined by the nuisance parameters for the second and higherorder moments. The avoidance of distributional assumptions for Y _{ i } leads to a method of estimation known as generalized estimating equations. Thus, the GEE approach can be thought of as providing a convenient alternative to ML estimation. The GEE approach proposed by Liang and Zeger (1986) is a multivariate generalization of the quasilikelihood approach for generalized linear models introduced by Wedderburn (1974). To better understand this connection to quasilikelihood estimation, in the following we briefly outline the quasilikelihood approach for generalized linear models for a univariate response before discussing its extension to multivariate responses.
In the following, we now assume N independent observations of a scalar response variable, Y_{i} . Associated with the response, Y_{i} , there are p covariates, X _{ i1},...,X_{ip} . We assume that primary interest is in relating the mean of Y_{i} , μ_{i} = E(Y_{i} X _{ i1},...,X_{ip} ), to the covariates. In generalized linear models, the distribution of the response is assumed to belong to the exponential family of distributions (e.g., normal, Bernoulli, binomial, and Poisson). As described in the first part of the specification of the marginal model in Section 3.2.2, in generalized linear models a transformation of the mean response, μ_{i} , is linearly related to the covariates via an appropriate link function,
where the link function h ^{−1}(·) is a known function, such as log(μ_{i} ). The assumption that Y_{i} has an exponential family distribution has implications for the variance of Y_{i} . In particular, a feature of exponential family distributions is that the variance of Y_{i} can be expressed in terms of a known function of the mean and a scale parameter,
where the scale parameter ϕ > 0. The variance function, v(μ_{i} ), describes how the variance of the response is functionally related to the mean of Y_{i} ; the variance function was previously discussed in the second part of the specification of the marginal model in Section 3.2.2.
Next, we consider estimation of β . Assuming Y_{i} follows an exponential family density, with Var(Y_{i} ) = ϕv(μ_{i} ), the ML estimator of β is obtained as the solution to the likelihood score equations,
where ∂μ_{i} /∂ β is the 1 × p vector of derivatives, ∂μ _{ i }/∂β _{ k },k = 1,...,p. Interestingly, the likelihood equations for generalized linear models depend only on the mean and variance of the response (and the link function). Consequently, Wedderburn (1974) suggested using them as “estimating equations” for any choice of link or variance function, even when the particular choice of variance function does not correspond to an exponential family distribution. That is, Wedderburn (1974) proposed estimating β by solving the quasilikelihood equations,
Wedderburn (1974) showed that for any choice of weights, V_{i} , the quasilikelihood estimator of β , say $\hat{\mathit{\beta}}$
, is consistent and asymptotically normal. The choice of weights V_{i} = Var(Y_{i} ) yields the estimator with smallest variance among all estimators in this class. Note that in generalized linear models, it is assumed that V_{i} = Var(Y_{i} )= φv(μ _{ i }), and this assumption is sufficient to characterize the distribution within the exponential family. Thus, the optimal estimating equations (or quasilikelihood equations) coincide with the score equations for the case where Y_{i} is assumed to have an exponential family distribution.In summary, Wedderburn (1974) proposed estimators of β that do not require distributional assumptions on the response. This allows more flexible models for variability, e.g., incorporating overdispersion. Moreover, quasilikelihood estimation only requires correct specification of the model for the mean to yield consistent and asymptotically normal estimators of β . That is, a key property of quasilikelihood estimators is that they are consistent even when the variance of the response has been misspecified, that is, V_{i} ≠ Var(Y_{i} ). Specifically, it can be shown that the asymptotic distribution of $\hat{\mathit{\beta}}$
, the estimator for β obtained from (3.1) with a particular choice of V_{i} , satisfieswhere
and
Consistent estimators of the asymptotic covariance of the estimated regression parameters can be obtained using the empirical estimator of C _{ β } first suggested by Cox (1961), and later proposed by Huber (1967), White (1982), and Royall (1986). The empirical variance estimator is obtained by evaluating ∂μ_{i} /∂β at $\hat{\mathit{\beta}}$
and substituting ${({Y}_{i}{\hat{\mu}}_{i})}^{2}$ for Var(Y_{i} ); this is widely known as the sandwich variance estimator. Moreover, it can be shown that the same asymptotic distribution holds when V_{i} is estimated rather than known, with V_{i} replaced by estimated weights, say ${\hat{V}}_{i}$ .Next, we consider the multivariate extension of this quasilikelihood approach to the setting of marginal models for longitudinal responses. In the following, Y _{ i } denotes an n_{i} × 1 vector of responses (similarly, μ _{ i } denotes an n _{ i } × 1 vector of means) and X_{i} is an n_{i} × p matrix of covariates. The fundamental idea underlying the GEE approach is to extend the quasilikelihood equations (or estimating equations) to the multivariate setting by replacing Y_{i} and μ_{i} by their corresponding multivariate counterparts ( Y _{ i } and μ _{i} ) and using a matrix of weights V_{i} . Thus, the estimating equations for β in a marginal model are given by
where D_{i} = ∂ μ _{i} /∂β is now an n_{i} × p matrix of derivatives whose kth row is ∂ μ _{i} /∂β_{k} . As with the quasilikelihood approach, the optimal choice of V_{i} is to take V_{i} = Cov( Y _{ i }), the n_{i} ×n_{i} covariance matrix for Y _{ i }. Note that the GEE given by (3.2) is simply the multivariate analog of the quasilikelihood equations given by (3.1). However, unlike the quasilikelihood equations given by (3.1), the weight matrix, V_{i} , depends not only on β but also on the pairwise associations among the longitudinal responses.
As mentioned in the previous section, GEE can be regarded as a multivariate extension of the quasilikelihood estimating equations. Note that the scalar weight in (3.1) is replaced by the weight matrix, V_{i} , in (3.2). In general, the assumed covariance among the responses can be specified as
where A_{i} = diag{v(μ_{ij} )} is a diagonal matrix with diagonal elements v(μ_{ij} ), which are specified entirely by the marginal means (i.e., by β ), R_{i} ( α ) is an n_{i} × n_{i} correlation matrix, and φ is a dispersion parameter. For the correlation matrix R_{i} ( α ), α represents a vector of parameters associated with a specified model for Corr( Y _{ i }), with typical element
In the GEE approach, V_{i} is usually referred to as a “working covariance,” where the term “working” is used to emphasize that V_{i} is only an approximation to the true covariance, say Σ _{i} = Cov( Y _{ i }). Sometimes, R_{i} ( α ) is referred to as a “working correlation” matrix; however, this implicitly assumes that the marginal variance assumption, Var(Y_{ij} )= φv(μ_{ij} ), is correct. Since both R_{i} ( α ) and Var(Y_{ij} ) can be incorrectly specified, we prefer the use of the term “working covariance.” Note that if R_{i} ( α ) = I, the n_{i} × n_{i} identity matrix, then the GEE reduces to the quasilikelihood estimating equations for a generalized linear model that assume the repeated measures are independent.
Liang and Zeger (1986), and also Prentice (1988), parameterized the withinsubject correlations directly as a linear function of α (typically ρ_{ist} ( α ) = α_{st} , although ρ_{ist} ( α ) could, in principle, be allowed to depend on covariates). When α is known, the only unknown quantity in (3.2) is β and the solution to (3.2) is a consistent estimator of β . In the usual case where V_{i} , and specifically α , is unknown, then we must parameterize and estimate ρ_{ist} ( α ) = Corr(Y_{is} ,Y_{it} ). Some common examples of models for the correlation include: “exchangeable,” in which ρ_{ist} = α for all s < t; firstorder autoregressive (AR(1)),
where 0 < α < 1, and the correlation decreases as the time between measurements (t − s) increases; and “unstructured,” in which ρ_{ist} = α_{st} . For estimation of α , let
which has expected value ρ_{ist} ; the U_{ist} can be grouped together to form the n_{i} (n_{i} − 1)/2 × 1 vector U _{ i }( β ) = (U _{ i12},U _{ i13},...,U _{ ini −1ni })′. Also, let ρ _{i} ( α ) = E( U _{ i }; α ) = (ρ _{ i12},ρ _{ i13},..., ρ _{ in i−1 ni })′. Then, a second set of (moment) estimating equations similar to (3.2) can be used to estimate α , given by
where W_{i} ≈ Cov( U _{ i }) and E_{i} = ∂ρ _{i} ( α )/∂α . The working covariance matrix for U _{ i } is typically specified as diag{Var(U_{ist} )}. In their original paper on GEE, Liang and Zeger (1986) let W_{i} be the (n_{i} × n_{i} − 1)/2 × (n_{i} × n_{i} − 1)/2 identity matrix, whereas Prentice (1988) suggested letting W_{i} be a diagonal matrix with the approximate variances of U_{ist} along the diagonal.
For the special case where the outcome is binary, an alternative to the correlation as a measure of association between pairs of binary responses is the odds ratio. The odds ratio has many desirable properties (e.g., see Bishop, Fienberg, and Holland, 1975, Chapter 11) and has a more straightforward interpretation. To estimate the odds ratio as a measure of association, a second set of estimating equations similar to (3.3) (Lipsitz, Laird, and Harrington, 1991) can be used in combination with (3.2) for the marginal regression parameters. A more efficient second set of estimating equations to estimate the odds ratio parameters, referred to as “alternating logistic regression,” was later proposed by Carey, Zeger, and Diggle (1993), and is discussed in greater detail in the next section. We also note that Qu et al. (1992) proposed using tetrachoric correlations as measures of association between pairs of binary responses; the tetrachoric correlation is the correlation between underlying latent normal variables that are assumed to have been dichotomized to form the observed binary outcomes. Additional work extending the GEE approach to longitudinal ordinal and polytomous data has been developed by, for example, Miller, Davis, and Landis (1993), Lipsitz, Kim, and Zhao (1994), Kenward, Lesaffre, and Molenberghs (1994), Gange et al. (1995), Lumley (1996), and Heagerty and Zeger (1996). The main challenge with extending the GEE approach to ordinal and polytomous responses has been that the working covariance for ordinal and polytomous responses (other than “working independence”), in general, requires the specification and estimation of a large number of nuisance parameters.
Given any parameterization of the working covariance, V_{i} , in (3.2), and using Taylor series expansions similar to Prentice (1988), assuming that the regression model for the mean of Y _{ i } has been correctly specified, $\hat{\mathit{\beta}}$
is consistent for β , and also $\sqrt{N}(\hat{\mathit{\beta}}\mathit{\beta})$ has an asymptotic distribution which is multivariate normal with mean vector 0 and covariance matrix given byand if Cov( Y _{ i }) is correctly specified, so that V_{i} = Cov( Y _{ i }), then (3.4) reduces to
Note that a key property of the GEE estimators of β is that they are consistent and asymptotically normal, for any choice of working covariance V_{i} , provided the regression model for the mean response has been correctly specified. Heuristically, using results from the method of moments, $\hat{\mathit{\beta}}$
is consistent for β regardless of whether V_{i} is the true covariance matrix of Y _{ i } becauseand we are solving ${\mu}_{\beta}\left(\hat{\mathit{\beta}}\right)=\mathbf{0}$
for $\hat{\mathit{\beta}}$ . In particular, for any positive definite and symmetric matrix V_{i} , the solution to (3.2) is a consistent estimator of β . This is the key property of the GEE method and implies that V_{i} does not have to be correctly specified in order to obtain consistent estimators of β .Finally, C_{β} in (3.4) can be consistently estimated by ${\hat{C}}_{\beta}$
, which is obtained by replacing β and α by their estimates, and also Cov( Y _{ i }) by $({\mathit{Y}}_{i}{\hat{\mathit{\mu}}}_{i}){({\mathit{Y}}_{i}{\hat{\mathit{\mu}}}_{i})}^{\prime}$ . The estimator ${\hat{C}}_{\beta}$ is a sandwich variance estimator and has the same form as the sandwich variance estimator in our earlier discussion of quasilikelihood estimation. Some authors refer to this sandwich variance estimator as the “robust” variance estimator because it is consistent for the asymptotic variance of $\hat{\mathit{\beta}}$ provided $E\left\{{u}_{\beta}\left(\hat{\mathit{\beta}}\right)\right\}=\mathbf{0}$ ; that is, the sandwich variance estimator can be said to be robust to misspecification of the covariance among the repeated measures.Obtaining GEE estimates requires an iterative algorithm. The structure of the GEE suggests the use of a specific iterative scheme, namely to iterate between estimating β (given the current estimate of α ) as the solution to (3.2), and estimating α (given the current estimate of β ) as the solution to (3.3) until convergence. In particular, the solution $(\hat{\mathit{\beta}},\hat{\mathit{\alpha}})$
to (3.2) and (3.3) can be obtained by a Fisher scoring algorithm. Given a starting value for β , say under the naive assumption of independence, the solution $(\hat{\mathit{\beta}},\hat{\mathit{\alpha}})$ can be obtained by iterating betweenand
until ${\hat{\mathit{\beta}}}^{(m+1)}={\hat{\mathit{\beta}}}^{\left(m\right)}$
and ${\hat{\mathit{\alpha}}}^{(m+1)}={\hat{\mathit{\alpha}}}^{\left(m\right)}$ , where ${D}_{i}^{\left(m\right)}={D}_{i}\left\{{\hat{\mathit{\beta}}}^{\left(m\right)}\right\},{V}_{i}^{\left(m\right)}={V}_{i}\{{\hat{\mathit{\beta}}}^{\left(m\right)},{\hat{\mathit{\alpha}}}^{\left(m\right)}\},{E}_{i}^{\left(m\right)}={E}_{i}\{{\hat{\mathit{\beta}}}^{\left(m\right)},{\hat{\mathit{\alpha}}}^{\left(m\right)}\}$ , and ${W}_{i}^{\left(m\right)}={W}_{i}\{{\hat{\mathit{\beta}}}^{\left(m\right)},{\hat{\mathit{\alpha}}}^{\left(m\right)}\}$ .We note that, given the current estimate of β , the estimator of α proposed by Liang and Zeger (1986) is noniterative. For example, suppose an “exchangeable” correlation pattern is assumed, in which ρ_{ist} = α for all s < t. Then, Liang and Zeger (1986) proposed estimating α, given the current estimate of β , say $\hat{\mathit{\beta}}$
, byalternatively, various degreeoffreedom corrections to account for the estimation of β have been suggested for the denominator N ^{∗}, e.g., N ^{∗} − p.
In this section we consider properties of the GEE estimators. Before summarizing the key properties of the GEE approach, we note that there is an implicit assumption in the first component of a marginal model that is often overlooked. Recall that in a marginal model the conditional expectation of each response is assumed to depend on the covariates through a known link function
In marginal models, the primary interest lies in estimation of the regression parameters, β , in the model for E(Y_{ij}  X _{ij} ). The key property for (asymptotically) unbiased estimators using GEE is given in (3.6), and can be rewritten as
where E_{xi } (·) denotes expectation with respect to the marginal distribution of X_{i} and E _{ yi xi }(·) denotes expectation with respect to the conditional distribution of Y _{ i } given X_{i} . Note that the jth element of the vector μ _{i} ( β ) is μ_{ij} = E(Y_{ij}  X _{ij} ), but the elements of E _{ yi xi }( Y _{ i }) in (3.8) are E(Y_{ij}  X _{ i1},..., X _{ ini }). Thus, for (3.8) to hold, the conditional mean of the jth response, given X _{ i1},..., X _{ini } , must depend only on X _{ij} , that is,
see Fitzmaurice, Laird, and Rotnitzky (1993) and Pepe and Anderson (1994) for a more detailed discussion of this sufficient condition. With timestationary covariates, this assumption poses no difficulties; it necessarily holds because X _{ij} = X _{ik} for all occasions k ≠ j. Also, with timevarying covariates that are fixed by design of the study (e.g., time since baseline, treatment group indicator in a crossover trial), the assumption also holds because values of the covariates at any occasion are determined a priori by study design and in a manner completely unrelated to the longitudinal response. However, when a timevarying covariate varies randomly over time the assumption made in (3.9) may not hold. For example, the assumption will be violated when the current value of the response, say Y_{ij} , given the current covariates X _{ij} , predicts the subsequent value of X _{ i,j+1}; see Chapter 23 for a more detailed discussion of stochastic timevarying covariates in longitudinal studies. When (3.9) is not satisfied, yet there is subjectmatter interest in the dependence of Y_{ij} on X _{ij} , Pepe and Anderson (1994) recommend using GEE with a “working independence” assumption. Under a “working independence” assumption, the weight matrix is diagonal and the corresponding estimating equations simplify and are unbiased regardless of whether or not (3.9) is satisfied. In contrast, under alternative choices for the working covariance matrix, the estimating equations are not necessarily unbiased and may yield inconsistent estimators of the regression parameters.
Thus far, we have seen that the GEE approach provides a convenient alternative to ML estimation of the regression parameters in marginal models for longitudinal data, while also retaining a number of appealing properties. First, in many longitudinal designs the GEE estimator $\hat{\mathit{\beta}}$
is almost as efficient as the ML estimator. For example, consider the case of linear models for continuous responses that are assumed to have a multivariate normal distribution. It can easily be shown that the generalized leastsquares estimator of β in linear models can be considered a special case of the GEE approach. The GEE also has an expression similar to the likelihood equations for β in certain “mixed parameter” models for discrete longitudinal data (e.g., Fitzmaurice and Laird, 1993; Fitzmaurice, Laird, and Rotnitzky, 1993). As a result, for many longitudinal designs, there is little loss of efficiency when the GEE approach is adopted as an alternative to maximum likelihood. Moreover, because all GEE estimators of β are consistent and asymptotically normal, it is of interest to consider their efficiency under various working covariance assumptions. Indeed, it has been suggested in the literature that setting R_{i} = I, the identity matrix, leads to an estimator with nearly the same efficiency as the optimally weighted GEE (with V_{i} = Σ _{i} ). Interestingly, for a balanced longitudinal design, with no missing data, there are cases where this claim has some justification. Specifically, there is relatively little loss of efficiency either for betweensubject effects (where the covariate design X_{ijk} = X _{ ij′k } for all occasions j ≠ j′) or for withinsubject effects when the covariate design on time for the latter effects is the same for all subjects (i.e., X_{ijk} ≠ X _{ ij′k } for some occasions j ≠ j′, but X_{ijk} = X _{ i′jk } for all subjects i ≠ i′). However, for the case of a withinsubject effect when the covariate design on time is not the same for all subjects (i.e., X_{ijk} ≠ X _{ ij′k } for some occasions j ≠ j′ and X_{ijk} ≠ X _{ i′jk } for some subjects i ≠ i′), there can be a very discernible loss of efficiency under the nonoptimal “working independence” assumption for the covariance, especially when the true correlations are moderately large (see, for example, Lipsitz et al., 1994). Heuristically, this result can be explained as follows. When using GEE methods, the estimated β s in a marginal model can be thought of as weighted averages of between and withinsubject contrasts. In the case of a betweensubject effect (e.g., exposure or treatment group) or a withinsubject effect when the covariate design on time is the same for all subjects (e.g., time since baseline), the respective between or withinsubject contrasts are weighted approximately equally, regardless of the assumed (or working) covariance. On the other hand, for a withinsubject effect when the covariate design on time is not the same for all subjects, the GEE estimate is a weighted average of both betweensubject and withinsubject contrasts that are weighted differently. Moreover, the weights for these different contrasts depend on the assumed covariance. So, for the latter case, the correlation is more important and determines the optimal weights for combining the contrasts. In addition, in the setting of unbalanced data, with n_{i} = n, each individual no longer contributes equally weighted components to even the betweensubject effects, and thus we can expect a loss of efficiency for the GEE under “working independence” in that setting as well. So, in general, it can be advantageous to choose a working covariance that closely approximates the true covariance. The closer the working covariance matrix (V_{i} ) approximates the true underlying covariance matrix (Σ _{i} ), the greater the efficiency for estimation of β .A second appealing property of the GEE estimator $\hat{\mathit{\beta}}$
is its robustness, yielding a consistent estimator of β even if the withinsubject associations among the repeated measures have been misspecified. It only requires that the model for the mean response be correct. This robustness property of GEE is important because the usual focus of a longitudinal study is on changes in the mean response. In general, there is usually far less interest in, and correspondingly less subjectmatter knowledge of, the patterns of covariance among the repeated measures. Although the GEE approach yields a consistent estimator of β under misspecification of the withinsubject associations, the usual standard errors obtained under the misspecified model for the withinsubject association are not valid. Fortunately, in many cases, valid standard errors for $\hat{\mathit{\beta}}$ can be obtained using the socalled sandwich variance estimator. A remarkable property of the sandwich estimator is that it is also robust in the sense that it provides valid standard errors when the assumed model for the covariances among the repeated measures is not correct. That is, with large sample sizes, the sandwich variance estimator yields correct standard errors. However, it is worth emphasizing that this robustness property of the sandwich variance estimator is an asymptotic property. In general, use of the sandwich variance estimator is best suited to balanced longitudinal designs where the number of subjects (N) is relatively large and the number of repeated measures (n) is relatively small. Moreover, the sandwich estimator is less appealing when the design is severely unbalanced and/or when there are few replications to estimate the true underlying covariance matrix. The use of the sandwich estimator implicitly relies on there being many replications of the vector of responses associated with each distinct set of covariate values.Note that biascorrected versions of the sandwich variance estimator have been proposed that have somewhat better finitesample properties, including the jackknife (Paik, 1988; Mancl and DeRouen, 2001). Most of these biascorrected versions involve a “degreesoffreedom” correction. In cases where there are few, if any, replications of Y _{ i } associated with each distinct set of covariate values, the use of the sandwich estimator can be problematic. In particular, standard errors based on the sandwich estimator tend to be biased downward. In addition, the sampling variability of the sandwich estimator of $\text{Cov}\left(\hat{\mathit{\beta}}\right)$
can be very large, resulting in an unstable estimator of variability. In these settings, it may be preferable to carefully model the covariances among the responses and use the “modelbased” estimator of $\text{Cov}\left(\hat{\mathit{\beta}}\right)$ given by (3.5) and evaluated at $(\hat{\mathit{\beta}},\hat{\mathit{\alpha}})$ . This estimator of $\text{Cov}\left(\hat{\mathit{\beta}}\right)$ is referred to as a “modelbased” estimator to remind us that it yields valid standard errors provided that the working covariance matrix, V_{i} , is a close approximation to the true underlying covariance matrix, Σ _{i} . In general, unlike the sandwich variance estimator, the “modelbased” estimator does not require such a large number of replications; the sampling variability of the “modelbased” estimator tends to be smaller than that of the sandwich variance estimator. The key to obtaining approximately unbiased variance estimators using the “modelbased” estimator is to choose a model for the working covariance matrix that is close to the true covariance matrix.Finally, for making inferences about β , since the GEE approach is not likelihoodbased, likelihood ratio tests are not available for hypotheses testing; in addition, this also has ramifications for inferences when there are missing data, a topic that will be discussed in Section 3.4. Instead, inferences typically rely on Wald test statistics based on quadratic forms. For example, if it is of interest to test whether or not the ℓth element of β is 0, H_{0} : β _{ ℓ } = 0, then the Wald test statistic,
can be constructed, where $\hat{\text{Var}}\left({\hat{\beta}}_{\ell}\right)$
is the ℓth diagonal element of either the modelbased or, more typically, the sandwich variance estimate. More generally, for a null hypothesis of the form H _{0} : Lβ = 0, versus the alternative H_{A} : Lβ ≠ 0, for an r × p matrix L of full rank, r ≤ p, the Wald test statistic iswhere ${\chi}_{r}^{2}$
denotes a chisquare distribution with r degrees of freedom.Although this use of the Wald statistic allows for the comparison of nested models, it can be potentially problematic because Wald statistics are known to have less than optimal properties under the alternative in logistic regression models for univariate outcome data (Hauck and Donner, 1977). We conjecture that the same may be true for Wald statistics based on GEE estimates of marginal regression parameters for longitudinal binary data. As an alternative, and to circumvent this potential problem with the Wald statistic, one can base inferences on the score test statistic proposed by Rotnitzky and Jewell (1990) and Boos (1992). Recall the property of the GEE score vector u _{ β }( β ) given in (3.6), namely E{u_{β} ( β )} = 0. Furthermore, it can be shown that
Finally, since the score vector is a sum of independent random variables, using the central limit theorem, it can be shown that
Then, the “score” test statistic can be expressed as a quadratic form in the score vector,
where ${u}_{\beta}\left(\tilde{\mathit{\beta}}\right)$
and ${\mathit{V}}_{u}\left(\tilde{\mathit{\beta}}\right)$ are the score vector and its variance under the alternative hypothesis, evaluated at $\tilde{\beta}$ , which is the estimate of β under the null hypothesis H _{0} : Lβ = 0.In this section we briefly review some extensions of the standard GEE methods proposed by Liang and Zeger (1986). In particular, we describe alternative estimators of the withinsubject association, especially when the longitudinal responses are discrete. We also discuss joint estimation of the marginal mean and withinsubject association parameters.
As discussed in the previous section, when V_{i} is correctly specified in the standard GEE approach, the resulting estimator of β is semiparametric efficient in the sense of having the smallest variance among all estimators in the class given by (3.2). However, the estimators of the correlation parameters α proposed by Liang and Zeger (1986) are not efficient, in large part due to the use of a suboptimal weight matrix, W_{i} , in (3.3). This has led researchers to develop alternative estimating equations for α . Using the same set of estimating equations for β given in (3.2), we now consider two alternative estimating equations for α (and thus for V_{i} ( β , α )). We note that any set of estimating equations that use (3.2) to estimate β are often referred to as firstorder GEE or “GEE1”. Here, we consider two alternative GEE1s developed with the goal of providing more stable and/or efficient estimates of α than the standard GEE of Liang and Zeger (1986).
One set of estimating equations for α that has generated some interest in the statistical literature is that based on the notion of “Gaussian” estimation (see, for example, Lipsitz, Laird, and Harrington, 1992; Lee, Laird, and Johnston, 1999; Hall and Severini, 1998; Lipsitz et al., 2000; Fitzmaurice, Lipsitz, and Molenberghs, 2001; Wang and Carey, 2003). The key idea here is to base estimation of α on the multivariate normal estimating equations for the correlations. In particular, let
be the vector of standardized residuals, where A_{i} is a diagonal matrix with diagonal elements v(μ_{ij} ). Note that Cov( ξ _{i} ) is the correlation matrix of Y _{ i }, which we denoted earlier as R_{i} = R_{i} ( α ). Then, a second set of (moment) estimating equations can be obtained as the score equations for α under the assumption that ξ _{i} ∼ N(0,R_{i} ( α )). Specifically, the estimating equations for α are given by
The rth component of u _{α} ( α ) in (3.11) equals
where ${\dot{\text{R}}}_{ir}\left(\mathit{\alpha}\right)={\partial R}_{i}\left(\mathit{\alpha}\right)/{\partial \alpha}_{r}$
and tr{·} denotes the trace of a matrix. An appealing feature of the use of the multivariate normal estimating equations is that it ensures that the estimated correlation matrix, ${R}_{i}\left(\hat{\mathit{\alpha}}\right)$ , is nonnegative definite when there are unbalanced data (i.e., n_{i} ≠ n). In contrast, this is not the case for alternative GEE1 methods. Thus, in general, this leads to more stable estimation of α . As a slight modification to these multivariate normal estimating equations, Wang and Carey (2004) proposed estimating the correlation parameters by differentiating the Cholesky decomposition of the working correlation matrix; a similar approach was also used by Ye and Pan (2006). Furthermore, similar Gaussian or “quadratic” estimating equations have been proposed by Crowder (1995) and Qu, Lindsay, and Li (2000).Motivated by the application of GEE methods to longitudinal binary outcomes, a second alternative set of estimating equations for α were proposed by Carey, Zeger, and Diggle (1993) and Lipsitz and Fitzmaurice (1996). Specifically, they proposed a set of estimating equations based on the conditional residuals {Y_{it} − E(Y_{it} Y_{is} = y_{is} , X_{i} )}, that is, deviations about conditional expectations. In contrast, note that the second set of estimating equations given by (3.3) are based on the unconditional residuals
The conditional expectations can be grouped together to form the n_{i} (n_{i} − 1)/2 × 1 vector of conditional residuals, ( U _{ i } − η _{i} ), where
with U_{ist} = Y_{it} and η_{ist} = E(Y_{it} Y_{is} = y_{is} ,X_{i} ), for s < t. Note that for the special case where the Y_{it} are binary, the conditional mean equals
where π _{ ist } = Pr(Y_{is} = 1, Y_{it} = 1; β , α ). Then, a set of estimating equations for α is given by
where E_{i} = ∂η _{i} /∂α and W_{i} = diag{Var(Y_{it} Y_{is} = y_{is} )}. Note that for binary Y_{it} , Var(Y_{it} Y_{is} = y_{is} ) = η_{ist} (1 − η_{ist} ) since $E\left({Y}_{it}^{2}\right{Y}_{is}={y}_{is})=E\left({Y}_{it}\right{Y}_{is}={y}_{is})={\eta}_{ist}$
.Although the conditional residuals (U_{ist} − η_{ist} ) are neither independent nor uncorrelated, Carey, Zeger, and Diggle (1993) argue that the correlations among the conditional residuals should be substantially smaller than the correlations among the unconditional residuals. Consequently, setting W_{i} = diag{η_{ist} (1 − η_{ist} )} in (3.13) should provide a closer approximation to the optimal weight matrix. Kuk (2004) proposed a symmetrized version of these estimating equations that may be particularly useful for an exchangeable correlation structure.
We note that after some algebra, (3.12) can be shown to equal
which is exactly the conditional expectation if (Y_{is} , Y_{it} ) are assumed to be bivariate normal. Thus, even though these estimating equations were originally proposed for analysis of longitudinal binary data, the form of the conditional expectation in (3.14) suggests that they can be used also for nonbinary longitudinal outcomes. The only additional issue that arises for nonbinary outcomes is the specification of Var(Y_{it} Y_{is} = y_{is} ). For binary outcome data, it is straightforward to specify this variance as Var(Y_{it} Y_{is} = y_{is} ) = η_{ist} (1 − η_{ist} ). For nonbinary outcomes, one alternative is to use the conditional variance from the bivariate normal, with
As with the standard GEE1 discussed in Section 3.2.3, the two alternative approaches discussed in this section require an iterative algorithm. That is, both of these GEE1 methods require iterating between estimating β (given the current estimate of α ) as the solution to (3.2), and estimating α (given the current estimate of β ) as the solution to (3.11) or (3.13), until convergence.
In the previous section, a number of alternative proposals for estimating α , and thus V_{i} ( β , α ), were described. All of these approaches broadly fall within the framework of GEE1. That is, all of the approaches discussed so far use estimating equations for β given by (3.2), and differ only in terms of how α , and hence V_{i} ( β , α ), is estimated. In this section we discuss a secondorder extension of generalized estimating equations, hereafter referred to as GEE2 (Zhao and Prentice, 1990; Liang, Zeger, and Qaqish, 1992). To do so, we need to introduce some additional notation. Let
Π _{i} = E( Y _{i} X_{i} , β , α ) and, in an ever so slight departure from previous notation, V_{i} ≈ Cov( Y _{i} X_{i} ). Then, the GEE2 estimating equations for θ = ( β ′, α ′)′ are given by
where D_{i} = ∂ Π _{i} ( θ )/∂θ . Note that, unlike GEE1, D_{i} (and V_{i} ) in GEE2 is not blockdiagonal and thus the estimating equations for β depend on ${\mathit{Y}}_{i}={({\mathit{Y}}_{i}^{\prime},{\mathit{Z}}_{i}^{\prime})}^{\prime}$
, and not simply on Y _{ i }. As a consequence, the GEE2 estimating equations for β given by (3.15) are quite different from the GEE1 estimating equations for β given by (3.2). In particular, for the GEE2 estimator of β to be consistent, the model for all elements of Y _{i} must be correctly specified; that is, the first two moments of Y _{ i } must be correctly specified. GEE2 can yield substantial bias in estimators of both β and α (Fitzmaurice, Lipsitz, and Molenberghs, 2001) if assumptions about second moments are misspecified. This is in contrast to GEE1 where a consistent estimator of β is obtained provided only that the mean of Y _{ i } has been correctly specified.The main appeal of GEE2 is that it is almost fully efficient for both the marginal regression parameters, β , and the withinsubject associations, α . Note that the working covariance matrix V_{i} is usually obtained by making additional assumptions about the third and fourth moments of Y i. Some alternative specifications for the third and fourth moments are discussed in Zhao and Prentice (1990) and Liang, Zeger, and Qaqish (1992). Thus, a notable feature of GEE2 is that it makes additional assumptions about higherorder moments. Furthermore, specification of these higherorder moments can greatly increase the computational complexity and make implementation of the method somewhat more difficult; it is notable that none of the major statistical software packages currently have options for GEE2.
Finally, we note that some authors refer to the solution to (3.2) and (3.13) as GEE2; in contrast, we prefer to reserve the term GEE2 to refer to estimators that require the first two moments of Y _{ i } to be correctly specified in order to yield consistent estimators of β and α . That is, in contrast to GEE1 estimators, GEE2 estimators of β are not robust to misspecification of the second moments.
Although most longitudinal studies are designed to collect complete data on all participants, missing data very commonly arise and must be properly accounted for in the analysis; otherwise, biased estimators of longitudinal change can result. When longitudinal data are missing, the data set is necessarily unbalanced over time because not all individuals have the same number of repeated measurements at a common set of occasions. This feature of missingness creates no difficulties for the standard GEE method as it can handle the unbalanced data without having to discard data on individuals with any missing data. That is, when some individuals’ response vectors are only partially observed, the standard GEE approach circumvents the problem of missing data by simply basing inferences on the observed responses. However, the validity of this method of analysis will require that certain assumptions about the reasons for any missingness, often referred to as the missingdata mechanism, are tenable.
The missingdata mechanism can be thought of as a model that describes the probability that a response is observed or missing at any occasion. Here, we briefly review and distinguish two general types of missingdata mechanisms; see Chapter 17 for a more detailed description. The two missingdata mechanisms are referred to as missing completely at random (MCAR) and missing at random (MAR) (Rubin, 1976; Laird, 1988). These two mechanisms differ in terms of assumptions concerning whether or not missingness is related to responses that have been observed. The distinction between these two mechanisms determines the appropriateness of standard GEE methods. Specifically, the standard GEE method yields consistent marginal regression parameter estimators provided the responses are MCAR (see, for example, Laird, 1988).
Data are said to be MCAR when the probability that responses are missing is unrelated to either the specific values that, in principle, should have been obtained (the missing responses) or the set of observed responses. Thus, longitudinal data are MCAR when missingness in Y _{ i } is simply the result of a chance mechanism that does not depend on either observed or unobserved components of Y _{ i }. To better understand this missingdata process, consider the simple case where there are only two measurement occasions with the outcome fully observed on all subjects at the first occasion, and missing on some subjects at the second occasion. Because the outcome can be missing at the second occasion only, we can define the single indicator random variable R _{ i2}, which equals 1 if Y _{ i2} is observed and 0 if Y _{ i2} is unobserved. The missing data are said to be MCAR if
that is, the probability that Y _{ i2} is missing does not depend on the observed value of Y _{ i1} or the possibly missing value of Y _{ i2}. We note that the use of the term MCAR is sometimes restricted to the case where
this distinction only becomes important when an analysis is based on a subset of the covariates in X_{i} that excludes a covariate that is predictive of R _{ i2}. The essential feature of MCAR is that the observed data can by thought of as a random sample of the complete data. As a result, all of the moments of the observed data do not differ from the corresponding moments of the complete data. This property provides the validity for the standard GEE method that bases inferences on the observed responses.
In contrast to MCAR, data are said to be MAR when the probability that responses are missing depends on the set of observed responses, but is unrelated to the specific missing values that, in principle, should have been obtained. For the simple bivariate response example considered previously, the missing data are said to be MAR if
that is, the probability that Y _{ i2} is missing depends on Y _{ i1} (and X _{ i }), but is conditionally independent of the possibly missing value of Y _{ i2}. Because the missingdata mechanism now depends upon observed responses, the sample moments based on the available data are biased estimates of the corresponding moments in the target population. Consequently, when data are MAR, the standard GEE method that bases inferences on the observed responses can yield biased regression parameter estimates. Finally, if Pr(R _{ i2} = 1Y _{ i1}, Y _{ i2},X _{ i }) depends in any way on the possibly missing outcome Y _{ i2}, the missing data are said to be not missing at random (NMAR) or oftentimes referred to as being “nonignorably” missing. The case of NMAR is beyond the scope of this chapter; see Chapters 20 and 22 for a more indepth discussion of this topic.
When missing data are assumed to be MAR, there are two general approaches for handling this problem within the GEE framework: multiple imputation (e.g., Paik, 1997) and weighted estimating equations (Robins, Rotnitzky, and Zhao, 1995). The idea behind multiple imputation is very simple: substitute or fill in the values that were not recorded with imputed values. However, to reflect the uncertainty inherent in the imputation of the unobserved responses, the imputation process is repeated multiple times. The imputations are typically based on some assumed model for the missing data given the observed data. The attractive feature of imputation methods is that, once a filledin data set has been constructed, the standard GEE method for complete data can be applied. The multiple filledin data sets produce different sets of parameter estimates and their standard errors that are then appropriately combined to provide a single estimate of the parameters of interest, together with standard errors that reflect the uncertainty inherent in the imputation of the unobserved responses. There is an extensive literature on multiple imputation, and the reader can find an excellent review of this topic in Chapter 21. We do not discuss multiple imputation for GEE any further because, to date, there is no overwhelming concensus on the best way to impute discrete longitudinal data; see Chapter 21 for more details.
The second general approach for handling data that are MAR is via weighted estimating equations. In one of the simplest versions of the weighted estimating equations approach, an individual’s contribution to the standard GEE is weighted inversely by the probability of being observed at the given times. The key idea behind weighting methods is that the underrepresentation of certain response profiles in the observed data is taken into account and corrected. The weighted estimating equations approach is best suited to the case of monotone missingdata patterns as might arise when missingness is due to attrition or dropout. A variety of different weighting methods that adjust for dropout have been proposed. These approaches are often called propensity weighted or inverse probability weighted methods; see Chapter 20 for an excellent discussion of these methods. In all of these methods the underlying idea is to base estimation on the observed responses but weight them to account for the probability of remaining in the study.
Inverse probability weighted methods were first proposed in the sample survey literature, where the weights are known and based on the survey design (e.g., Horvitz and Thompson, 1952). In the weighted estimating equations approach, however, the weights are not known but must be estimated based on an assumed model for dropout. The propensities for dropout can be estimated as a function of the observed responses prior to dropout, and also as a function of the covariates and any extraneous variables that are thought likely to predict dropout.
Consider the simple example introduced earlier where there are only two measurement occasions and any missingness is restricted to the second occasion. In this simple case, the missingdata pattern is monotone. Furthermore, let us assume the data are MAR as in (3.16) and let
a function of possibly unknown parameters γ. The basic idea underlying the weighted GEE method is to weight each individual’s contribution to the standard GEE by the inverse probability of being observed, thereby accounting for those subjects with the same history of responses and covariates (y _{ i1},X _{ i }), but who were missing Y _{ i2}. Specifically, the GEE in (3.2) is weighted by the inverse probabilities to yield a simple “weighted GEE,”
Note that when R _{ i2} = 1, then Y _{ i } = (Y _{ i1}, Y _{ i2}) and, similarly, μ _{i} = (μ _{ i1}, μ _{ i2}); but when R _{ i2} = 0, then Y _{ i } = Y _{ i1} and, similarly, μ _{i} = μ _{ i1}.
The intuition behind this approach is that it reweights the observed data to mimic what would likely be seen in a data set without missing data, thereby producing unbiased estimating equations and consistent estimators for β . In particular, under MAR as in (3.16),
or, equivalently,
Therefore, the estimating equations given by (3.17) are unbiased for 0 at the true β ,
As the estimating equations are unbiased for 0, using results from method of moments, the solution $\hat{\mathit{\beta}}$
to (3.17) defines a consistent estimator for β .Unlike in the survey sampling setting where π_{i} is known, here π_{i} is unknown and will need to be replaced in (3.17) with an estimate; this estimate can be obtained using, for example, a logistic regression model with “outcome” R _{ i2} and “predictors” (y _{ i1},X _{ i }). In general, the validity of most weighting methods requires that the model for the missingness probabilities, π _{ i }, has been correctly specified. We note that the weighted GEE given above is a very simple special case of a general class of weighted estimators. In particular, Robins, Rotnitzky, and Zhao (1995) discuss more general weighted estimating equations for longitudinal data and the construction of “semiparametric efficient” weighted estimating equations; see Chapter 20 for an excellent summary of these developments. Finally, Robins (2000) and Robins and Rotnitzky (2001) have also recently developed socalled “doubly robust” weighted estimators that relax the assumption that the model for the missingness probabilities, π _{ i }, has been correctly specified, albeit requiring additional assumptions on the model for Y _{ i } given X _{ i }. Doubly robust methods require the specification of two models, one for the missingness probabilities and another for the distribution of the complete data. For doubly robust estimation, the weighted GEE is augmented by a function of the response. When this augmentation term is selected and modeled correctly according to the distribution of the complete data, the estimator of β is consistent even if the model for missingness is misspecified. On the other hand, if the model for missingness is correctly specified, the augmentation term does not need to be correctly specified to yield consistent estimators of β (Scharfstein, Rotnitzky, and Robins, 1999, see Section 3.2 of Rejoinder; Robins and Rotnitzky, 2001; van der Laan and Robins, 2003; Bang and Robins, 2005; see also Lipsitz, Ibrahim, and Zhao, 1999; Lunceford and Davidian, 2004). Thus, the appealing property of doubly robust methods is that they yield estimators that are consistent when either, but not necessarily both, the model for the missingness mechanism or the model for the distribution of the complete data has been correctly specified. These estimators are doubly robust in the sense of providing double protection against model misspecification. See Chapter 23 for a more detailed discussion of doubly robust estimators.
As noted previously, the weighted estimating equations approach is best suited to the case of monotone missingdata patterns. However, in many longitudinal studies an individual’s response can be missing at one followup time, and be measured at the next followup time, resulting in a large class of distinct missingness patterns. This is often referred to as “intermittent” missingness or nonmonotone missingness. In that setting, approximate methods have been proposed for handling missing data that are assumed to be MAR but not MCAR. For example, via simulations and asymptotic studies, Lipsitz et al. (2000) and Fitzmaurice, Lipsitz, and Molenberghs (2001) show that the GEE1 using the second set of “Gaussian” estimating equations described in Section 3.3.1 yields estimators of β with relatively small bias in many settings. However, this method does require that the withinsubject association, usually considered a nuisance characteristic of the data, is correctly specified. The method does not, however, require estimation of the missingness probabilities, π _{ i }.
As is the case for any generalized linear model, model checking is an important aspect of the fitting of marginal models to longitudinal data. However, because GEE methods are not likelihoodbased, many of the standard goodnessoffit statistics and model diagnostics are not immediately available. In this section we very briefly review some recent literature on this topic; further research on model diagnostics is needed.
Recall that in linear regression with independent univariate outcome data, Cook’s distance and socalled “DFBETAs” are widely used measures of influence (see, for example, Belsley, Kuh, and Welsch, 1980; Cook and Weisberg, 1982). Specifically, DFBETAs assess how each coefficient is changed by including the observation from the ith subject, thereby measuring the influence or effect that the ith subject has on the estimate of the regression parameters β . It is calculated by deleting the ith observation, and recomputing the ordinary leastsquares estimate of β . Because ordinary least squares is noniterative, fast computer algorithms have been developed to recompute the estimate of β for each subject. Pregibon (1981) extended DFBETAs to logistic regression using onestep estimators of β . Preisser and Qaqish (1996) extended DFBETAs in a similar way within the GEE approach. Their onestep DFBETAs for GEE require deletion of the n_{i} repeated measures for the ith subject. That is, conditional on the estimate α , a “onestep” estimate of β , say ${\hat{\mathit{\beta}}}_{i}$
, is obtained by excluding the n_{i} repeated measures for the ith subject and performing one step of the Fisher scoring algorithm described in Equation (3.7) with the fulldata estimate $\hat{\mathit{\beta}}$ as the starting value,To obtain a unitless, composite measure of the influence of each subject on the entire set of regression coefficients, a statistic similar to Wilks’s (1963) statistic for multivariate outliers,
can be used. A large value of w_{i} (relative to the other w_{i} s) indicates that the ith subject may have unusually high influence. Using Wilks’s (1963) criterion as a very rough guide, {(N − p − 1)(N − 1)/(Np)}w _{ i } has an approximate F distribution with p (the dimension of β ) and N − p − 1 degrees of freedom (for example, see Lipsitz, Laird, and Harrington, 1992).
Goodnessoffit criteria for generalized estimating equations have also been developed. For example, Pan (2001) proposed an extension of Akaike’s information criterion (AIC) to GEE. For the special case of longitudinal binary data, Barnhart and Williamson (1998) and Horton et al. (1999) proposed goodnessoffit tests for GEE that are extensions of the Hosmer–Lemeshow goodnessoffit test for logistic regression (Hosmer and Lemeshow, 1980). Here, we briefly review goodnessoffit statistics for GEE. Suppose it is of interest to determine whether or not the marginal model
provides an adequate fit to the data. A standard approach is to fit a broader model (e.g., a model with interactions and/or polynomial and higherorder terms) and test whether the additional terms are significantly different from zero. Alternatively, a “global goodnessoffit” statistic can be obtained by extending the Hosmer–Lemeshow statistic (Hosmer and Lemeshow, 1980; see also Tsiatis, 1980). Following the suggestion of Hosmer and Lemeshow for ordinary logistic regression, G (usually 10) groups can be formed based on combinations of the covariates X _{ij} in the logistic regression model. A test for goodness of fit is constructed by testing whether the additional regression coefficients for the G − 1 indicator variables differ from zero. Following Hosmer and Lemeshow, Horton et al. (1999) suggest forming groups based on deciles of the predicted probabilities from the given model,
Note that each subject has n_{i} separate estimates of risk $\left({\tilde{\mu}}_{ij}\text{s}\right)$
and that there are ${\mathrm{\Sigma}}_{i=1}^{N}\text{}{n}_{i}$ observations in total. Horton et al. (1999) suggest forming 10 groups of approximately equal size from the deciles of these predicted probabilities. For example, the first group contains the ${\mathrm{\Sigma}}_{i=1}^{N}\text{}{n}_{i}/10\text{}({Y}_{ij},{\mathit{X}}_{ij})\text{s}$ with the smallest values of ${\tilde{\mu}}_{ij}$ , and the last group contains the ${\mathrm{\Sigma}}_{i=1}^{N}\text{}{n}_{i}/10\text{}({Y}_{ij},{\mathit{X}}_{ij})\text{s}$ with the largest values of ${\tilde{\mu}}_{ij}$ . Finally, defining the G − 1 group indicatorswhere the groups are based on “percentiles of risk,” a test for goodness of fit is obtained by considering the alternative model
Specifically, the score statistic given in (3.10) can be used to test the null hypothesis
Finally, graphical displays are very useful techniques for conveying information about the most salient features of longitudinal data. They can provide insights about patterns of change in the mean response over time (e.g., linearity or the lack thereof) and the choice of suitable functional forms for covariates. Graphical techniques, especially those based on residuals, are especially useful for assessing the adequacy of any postulated model for longitudinal data. They are also useful for identifying observations and individuals that are potential outliers. With appropriate transformations, residual diagnostics developed for standard linear regression can be extended to the longitudinal setting. However, an acknowledged difficulty with conventional residual diagnostics is that they are somewhat subjective in nature. What appears to be a random scatter to one individual might be considered evidence of systematic trend to another. That is, it can be very difficult to discern whether an apparent trend in a scatter plot of the residual reflects some aspect of model misspecification or is simply a reflection of natural variation. McCullagh and Nelder (1989, pp. 392–393) aptly summarize this problem when they state that “the practical problem is that any finite set of residuals can be made to yield some kind of pattern if we look hard enough, so that we have to guard against overinterpretation.”
Recently, Lin, Wei, and Ying (2002) developed modelchecking techniques based on “cumulative sums” and “moving sums” of residuals that help discern the “signal” from the “noise.” The basic idea is to aggregate the residuals over certain coordinates. The coordinates typically used for these sums of residuals are the individual covariates (e.g., X_{ijk} , the kth covariate) and the fitted values, $h\left({\mathit{X}}_{ij}^{\prime}\hat{\mathit{\beta}}\right)$
. These authors demonstrated that a key advantage of working with sums of residuals, rather than crude residuals, is that a reference distribution is available to ascertain their natural variation. That is, the observed sums of the residuals can be compared, both graphically and numerically, to a reference distribution under the assumption of a correctly specified marginal model for the mean. This allows for the determination of whether any apparent pattern is evidence of a systematic trend or simply due to natural variation. This modelchecking technique developed by Lin, Wei, and Ying (2002) removes a large degree of subjectivity from the assessment of graphical displays of residuals and places residual diagnostics on a more objective footing.Specifically, if the assumed model for the mean response is correct, then the cumulative sums of residuals are centered at zero. Moreover, the distribution of the cumulative sum can be approximated by that of a Gaussian process with zero mean whose realizations can be generated via computer simulation. It is relatively straightforward to generate realizations from the distribution of the cumulative sum, under the assumption that the model for the mean is correct (the details are omitted here and the interested reader is referred to Lin, Wei, and Ying, 2002). Thus, to assess whether any apparent trend in the observed cumulative sum of residuals reflects systematic trends rather than chance fluctuations, a number of realizations from the appropriate Gaussian process can be superimposed. To the extent that the curves generated from the null distribution tend to be closer to and intersect zero more often than the observed curve, this provides evidence of lack of fit. This assessment can be put on a more formal footing by comparing the maximum absolute value of the observed cumulative sum to a large number of realizations (say 10,000) from the null distribution.
An appealing feature of the graphical and numerical methods based on cumulative and moving sums of residuals is that they are valid regardless of the true joint distribution of the longitudinal response vector; in particular, they do not require correct specification of the covariance among the responses. As such, these graphical and numerical techniques for assessing the model for the mean response are relatively robust to assumptions about the distribution of the responses and assumptions about the covariance among the repeated measures.
In this section we present results of analyses of cardiovascular abnormalities from a longitudinal study of children infected with HIV1 to illustrate some of the main ideas highlighted in earlier sections. Results from various crosssectional and shortterm longitudinal studies (Lipshultz et al., 1998) have suggested that children infected with HIV1 might have higher risks of cardiovascular abnormalities. We consider this hypothesis using data from the Pediatric Pulmonary and Cardiac Complications (P2C2) of Vertically Transmitted HIV Infection Study (Lipshultz et al., 1998). This was a large, prospective longitudinal study designed to monitor heart disease and the progression of cardiac abnormalities in children born to HIVinfected women. In the P2C2 study, a birth cohort of 401 infants born to women infected with HIV1 were scheduled to have their cardiovascular function measured approximately every year from birth to age 6, producing up to seven repeated measurements of cardiovascular function on each child. Of the 401 infants who participated in this study, 74 (18.8%) were HIV positive, and 319 (81.2%) were HIV negative.
Heart Pumping Ability at Age ^{a} 


Subject 
HIV ^{ b } 
Mom Smoked ^{ c } 
Gest. Age (wks) 
Low Birth Weight ^{ d } 
Birth 
1 
2 
3 
4 
5 
6 
1 
1 
0 
41 
0 
0 
0 
0 
0 
0 
0 
. 
2 
1 
1 
34 
0 
1 
. 
0 
0 
1 
. 
. 
3 
0 
1 
40 
0 
1 
0 
0 
. 
. 
. 
. 
4 
1 
0 
40 
0 
0 
. 
0 
0 
0 
1 
. 
5 
0 
1 
39 
0 
. 
1 
0 
. 
. 
. 
. 
6 
0 
1 
35 
0 
1 
. 
. 
. 
. 
. 
. 
7 
0 
0 
36 
0 
. 
0 
0 
. 
. 
. 
. 
7 
1 
0 
33 
1 
. 
1 
1 
1 
. 
. 
. 
8 
0 
0 
36 
1 
0 
0 
. 
. 
. 
. 
. 
9 
0 
0 
41 
1 
. 
. 
. 
. 
0 
. 
. 
10 
0 
1 
34 
1 
. 
0 
0 
. 
0 
1 
0 
( . = missing)
Notes:
a 1 = abnormal, 0 = normal.
b 1 = HIV positive, 0 = not HIV positive.
c 1 = mother smoked during pregnancy, 0 mother did not smoke.
d 1 = low birth weight for age, 0 = normal birth weight.
The question of main scientific interest is to determine if children infected with HIV1 have worse heart function over time. In particular, we are interested in modeling the pumping ability of the heart (left ventricular fractional shortening) over time. The outcome variable for this analysis is a binary response denoting abnormally low left ventricular fractional shortening (1 = low fractional shortening, 0 = normal) at each occasion. For these data, we are interested in modeling the marginal means or, equivalently, the probabilities of abnormal heart function over time, and relating changes in these marginal probabilities to covariates. The main covariate is the indicator of HIV infection; it is of interest to estimate the effect of HIV infection on heart function over time. Previous results (Lipshultz et al., 1998, 2000, 2002) from the P2C2 study have shown that subclinical cardiac abnormalities develop early in children born with HIV, and that they are frequent, persistent, and often progressive.
Thus, for these analyses we are interested in assessing whether children with HIV have progressively worse heart function over time compared to nonHIV children. In the regression model for abnormal heart function, this translates into a test of the timebyHIV status interaction. For these analyses, potential confounding variables that must be adjusted for include mother’s smoking status during pregnancy (coded 1 = yes, 0 = no), gestational age (in weeks), and birth weight standardized for age (coded 1 = abnormal, 0 = normal). Data from 10 randomly selected children are displayed in Table 3.1.
We note that there is a substantial amount of missing data. Although each child was scheduled to have an echocardiogram every year for the first 6 years of life, including at birth, only 1 (0.25%) of the 401 study participants had outcomes measured at all seven occasions; see Table 3.2 for the frequency distribution of the number of repeated echocardiograms. Table 3.3 shows the number of subjects with echocardiogram measurements at each of the seven measurement occasions. From Table 3.2, we see that only 91 subjects (22.7%) were seen on more than three occasions. From Table 3.3, we see that 276 of the 401 children (68.8%) have baseline measurements; after birth, the percentage of children with echocardiogram measurements declines until only 7 (1.7%) of the 401 subjects have measurements at 6 years of age. For illustrative purposes, the analyses presented here assume that the missing responses are MCAR (Rubin, 1976; Laird, 1988). As discussed in Section 3.4, when the outcome data are MCAR, all GEE approaches yield consistent estimators of the regression parameters provided the model for the mean is correctly specified. However, we caution the reader that a substantive analysis of these data would require a far more careful treatment of the missing data (see the chapters in Part 5 for detailed discussions of this topic).
Number of Echocardiograms 
Number of Subjects 
Percentage 

1 
148 
36.91 
2 
104 
25.94 
3 
58 
14.46 
4 
50 
12.47 
5 
30 
7.48 
6 
10 
2.49 
7 
1 
0.25 
Total 
401 
100.00 
To examine the effect of HIV1, we considered the following marginal logistic regression model for the probability of abnormal heart function at time t_{j} , denoted μ_{ij} :
for j = 1, ..., 7, where t_{j} = j − 1, HIV _{i} equals 1 if the ith child is born with HIV1 and equals 0 otherwise; smoke _{i} equals 1 if the mother smoked during pregnancy and 0 otherwise; age _{i} is the gestational age (in weeks); and wt _{i} equals 1 if the child’s birth weight for gestational age was abnormal and 0 otherwise. Because there are so few children with echocardiogram measurements after age 4, there are insufficient data to fit an unstructured correlation matrix with (7 × 6)/2 = 21 parameters,
Instead, to account for the withinsubject association among the binary responses, we consider two possible “working correlation” patterns: “exchangeable,” in which ρ_{ist} = α for all
where 0 < α < 1.
Finally, to illustrate similarities and differences between various GEE estimation techniques, we compare the estimates of ( β ,α) obtained using six approaches: (1) GEE under “working independence” (equivalent to ordinary logistic regression); (2) standard GEE1 with estimation of α based on unconditional residuals; (3) GEE1 with estimation of α based on conditional residuals; (4) GEE1 with Gaussian estimation of α; (5) GEE2 with V_{i} in (3.15) specified using the Bahadur distribution (Bahadur, 1961); and (6) maximum likelihood using the parametric Bahadur distribution, which is based on two and higherorder correlations.
Age at Visit (Years) 
Number of Subjects 
Percentage 

Birth 
276 
68.83 
1 
267 
66.58 
2 
154 
38.40 
3 
123 
30.67 
4 
83 
20.70 
5 
37 
9.23 
6 
7 
1.75 
Because the Bahadur distribution is used in both the GEE2 and ML approaches, we provide a brief description here. We define the standardized binary variable S_{ij} as
The pairwise correlation between Y_{ij} and Y_{ik} is ρ_{jk} = E(S_{ij}S_{ik} ), and the Rthorder correlation between the first R responses is defined as ρ _{12...R } = E(S _{ i1} S _{ i2} ...S _{ iR }). The Rthorder correlation between any R of the n repeated binary responses is defined similarly. The Bahadur representation of the 2 ^{n} − 1 multinomial probabilities corresponding to the joint distribution of (Y _{ i1}, Y _{ i2},...,Y _{ in }) is
For both GEE2 and ML approaches, we assumed all fifth and higherorder correlations are 0 (ρ_{jklmn} = ··· = ρ _{1...n } = 0). In addition, we assumed that all fourthorder correlations are the same, regardless of the sets of times (ρ_{jklm} = ρ _{ j′k′l′m } for all jklm ≠ j′k′l′m′), and that all thirdorder correlations are the same, regardless of the sets of times (ρ_{jkl} ≠ ρ _{ j′k′l } for all jkl ≠ j′k′l′). The two models for the pairwise correlations ρ _{ st } (exchangeable and autoregressive) are the same for all six estimation methods.
The logistic regression parameter estimates and standard errors are presented in Table 3.4 for the “working independence” GEE and GEE1 with exchangeable correlation estimated using unconditional residuals. Table 3.4 presents both modelbased and empirical (or socalled sandwich) variance estimates. As might be expected, the most discernible differences between the modelbased and empirical standard errors occur for the “working independence” GEE. This is because the naive assumption of independence among repeated binary responses obtained from the same child is “furthest” from the true underlying withinsubject correlation. In general, when the true correlation among repeated measures is high, the differences between the modelbased and empirical standard errors for the “working independence” GEE can be substantial; the former provide unrealistic estimates of the sampling variability. For the data from the P2C2 study, the estimated correlation is relatively modest (assuming an exchangeable correlation, the estimated correlation is 0.15); consequently, the differences between the modelbased and empirical standard errors are not too extreme. For example, for the “working independence” GEE, the largest differences are in the estimated variance of the timebyHIV infection interaction, with the empirical variance being almost 1.5 (or [0.16/0.13]^{2}) times larger than the modelbased variance, and in the gestational age effect where the empirical variance is almost 1.4 (or [0.36/0.31]^{2}) times larger than the modelbased variance. We note that for the “working independence” GEE the modelbased variance is not necessarily always smaller than the empirical variance; their relative size depends on whether the parameter of interest is the effect of a timevarying (or withinsubject) or timestationary (or betweensubject) covariate, the magnitude of the correlation, and the proportion of missing data over time (see Mountford et al., 2007, for a more detailed discussion). We also note that for the GEE1 with exchangeable correlation estimated using unconditional residuals, there are fewer differences between the modelbased and empirical standard errors. However, the small number of discernible differences (e.g., the estimated variances of the timebyHIV interaction) suggest that the model for the correlation could potentially be improved. That is, if the working model for the correlation has been correctly specified, in general, we would expect the modelbased and empirical standard errors to be relatively similar.
Effect 
Method 
Estimate 
Model SE 
SE 
Empirical Wald Z 
pValue 

Intercept 
IND 
2.284 
1.236 
1.508 
1.51 
0.130 
GEE1 
1.763 
1.416 
1.468 
1.20 
0.230 

Time 
IND 
−0.600 
0.076 
0.097 
−6.16 
<0.001 
GEE1 
−0.630 
0.078 
0.096 
−6.54 
<0.001 

HIV 
IND 
−0.069 
0.252 
0.266 
−0.26 
0.796 
GEE1 
−0.081 
0.255 
0.266 
−0.30 
0.761 

Time*HIV 
IND 
0.204 
0.130 
0.159 
1.28 
0.200 
GEE1 
0.283 
0.125 
0.149 
1.90 
0.058 

Mom Smoke 
IND 
−0.196 
0.153 
0.175 
−1.12 
0.263 
GEE1 
−0.219 
0.175 
0.168 
−1.31 
0.191 

Gest. Age 
IND 
−0.058 
0.031 
0.038 
−1.52 
0.130 
GEE1 
−0.044 
0.036 
0.037 
−1.19 
0.233 

Low Birth Wt. 
IND 
0.048 
0.163 
0.198 
0.24 
0.810 
GEE1 
0.096 
0.186 
0.186 
0.52 
0.604 
Based on the regression parameter estimates and empirical standard errors from the GEE1 with exchangeable correlation, there is the suggestion that the pattern of change in risk of abnormal heart function differs by HIV1 infection group (Z = 1.90, p < 0.06). Specifically, after adjustment for maternal smoking during pregnancy, gestational age, and birth weight, the risk decreases at a slower rate in the HIV1 infected children. Note that the interpretation of the results is very similar to the interpretation of results from an ordinary logistic regression model for crosssectional data. For example, children with HIV1 infection have $\text{exp}({\hat{\beta}}_{2}+{\hat{\beta}}_{3}{t}_{j})=\text{exp}(0.081+0.285{t}_{j})$
times the odds of having an abnormal pumping ability compared to children without HIV1 infection at time t_{j} . Specifically, at birth (t _{1} = 0), the ratio of their odds of having an abnormal pumping ability is approximately 1 (exp(−0.081) = 0.92), while at year 6 the ratio of their odds is approximately 5.0 (exp(−0.081 + 6 × 0.285) = 5.1).For illustrative purposes, Table 3.5 presents the parameter estimates (and empirical standard error estimates) obtained from all five methods for a “working exchangeable” correlation. Note that the regression parameter estimates and estimated standard errors from the three GEE1 methods are very similar to each other. In contrast, the regression parameter estimates from GEE2 are somewhat different. Recall that GEE2 requires the stronger assumption that both the first and second moments must be correctly specified. For these data, the empirical standard errors for GEE2 estimates are similar to those for GEE1; however, this result is not to be expected in general. Finally, as expected, ML yields the smallest estimated variances for the estimated regression parameters. The largest gain in efficiency is for the timebyHIV infection interaction, the effect of greatest subjectmatter interest. The estimated relative efficiency for GEE1 with exchangeable correlation versus ML is (0.12/0.15)^{2} = 66%. However, a potential drawback of ML is that the full joint distribution of the binary outcomes must be correctly specified; thus, the assumptions made about all fifth and higherorder correlations being zero must be correct for ML to yield asymptotically unbiased estimators of the regression parameters.
Effect 
Method 
Estimate 
SE 
Wald Z 
pValue 

Intercept 
GEE1 (uncond) 
1.763 
1.468 
1.20 
0.230 
GEE1 (cond) 
1.851 
1.470 
1.26 
0.208 

GEE1 (Gaussian) 
1.822 
1.469 
1.24 
0.216 

GEE2 
2.126 
1.443 
1.47 
0.141 

ML 
1.733 
1.336 
1.30 
0.195 

Time 
GEE1 (uncond) 
−0.630 
0.096 
−6.54 
<0.001 
GEE1 (cond) 
−0.626 
0.096 
−6.49 
<0.001 

GEE1 (Gaussian) 
−0.627 
0.096 
−6.51 
<0.001 

GEE2 
−0.583 
0.095 
−6.16 
<0.001 

ML 
−0.678 
0.084 
−8.10 
<0.001 

HIV 
GEE1 (uncond) 
−0.081 
0.266 
−0.30 
0.761 
GEE1 (cond) 
−0.079 
0.265 
−0.30 
0.765 

GEE1 (Gaussian) 
−0.080 
0.265 
−0.30 
0.764 

GEE2 
−0.047 
0.274 
−0.17 
0.865 

ML 
−0.037 
0.255 
−0.15 
0.884 

Time*HIV 
GEE1 (uncond) 
0.283 
0.149 
1.90 
0.058 
GEE1 (cond) 
0.271 
0.151 
1.80 
0.072 

GEE1 (Gaussian) 
0.275 
0.150 
1.83 
0.068 

GEE2 
0.264 
0.156 
1.70 
0.089 

ML 
0.287 
0.122 
2.35 
0.019 

Mom Smoke 
GEE1 (uncond) 
−0.219 
0.168 
−1.31 
0.191 
GEE1 (cond) 
−0.215 
0.168 
−1.28 
0.201 

GEE1 (Gaussian) 
−0.216 
0.168 
−1.29 
0.198 

GEE2 
−0.249 
0.166 
−1.50 
0.134 

ML 
−0.241 
0.159 
−1.51 
0.131 

Gest. Age 
GEE1 (uncond) 
−0.044 
0.037 
−1.19 
0.233 
GEE1 (cond) 
−0.047 
0.037 
−1.25 
0.211 

GEE1 (Gaussian) 
−0.046 
0.037 
−1.23 
0.219 

GEE2 
−0.055 
0.036 
−1.52 
0.128 

ML 
−0.043 
0.034 
−1.27 
0.204 

Low Birth Wt. 
GEE1 (uncond) 
0.096 
0.186 
0.52 
0.604 
GEE1 (cond) 
0.089 
0.187 
0.48 
0.632 

GEE1 (Gaussian) 
0.092 
0.186 
0.49 
0.622 

GEE2 
0.042 
0.180 
0.23 
0.816 

ML 
0.155 
0.168 
0.93 
0.356 

ρ 
GEE1 (uncond) 
0.153 
0.120 
1.28 
0.201 
GEE1 (cond) 
0.194 
0.078 
2.47 
0.013 

GEE1 (Gaussian) 
0.165 
0.049 
3.37 
0.001 

GEE2 
0.174 
0.059 
2.95 
0.003 

ML 
0.151 
0.034 
4.42 
<0.001 
Although all GEE1 methods produce similar standard errors for the estimated regression parameters, this is not the case for the estimated correlation parameter. Specifically, the GEE1 based on conditional residuals greatly reduces the estimated variance of the correlation compared to GEE1 with unconditional residuals, while GEE1 based on Gaussian estimation reduces the estimated variance even further. Note that GEE1 based on Gaussian estimation yields a smaller variance estimate than GEE2, although this result cannot be expected in general.
Effect 
Method 
Estimate 
SE 
Wald Z 
pValue 

Intercept 
GEE1 (uncond) 
2.018 
1.506 
1.34 
0.180 
GEE1 (cond) 
1.817 
1.508 
1.20 
0.228 

GEE1 (Gaussian) 
1.978 
1.506 
1.31 
0.190 

GEE2 
2.118 
1.429 
1.48 
0.138 

ML 
1.763 
1.352 
1.30 
0.193 

Time 
GEE1 (uncond) 
−0.634 
0.097 
−6.50 
<0.001 
GEE1 (cond) 
−0.661 
0.098 
−6.72 
<0.001 

GEE1 (Gaussian) 
−0.639 
0.098 
−6.55 
<0.001 

GEE2 
−0.519 
0.095 
−5.48 
<0.001 

ML 
−0.637 
0.088 
−7.28 
<0.001 

HIV 
GEE1 (uncond) 
−0.072 
0.264 
−0.27 
0.785 
GEE1 (cond) 
−0.075 
0.264 
−0.28 
0.777 

GEE1 (Gaussian) 
−0.073 
0.264 
−0.28 
0.783 

GEE2 
−0.073 
0.299 
−0.24 
0.808 

ML 
−0.037 
0.269 
−0.14 
0.891 

Time*HIV 
GEE1 (uncond) 
0.242 
0.157 
1.54 
0.123 
GEE1 (cond) 
0.273 
0.155 
1.76 
0.078 

GEE1 (Gaussian) 
0.248 
0.156 
1.58 
0.114 

GEE2 
0.173 
0.168 
1.03 
0.302 

ML 
0.213 
0.140 
1.53 
0.128 

Mom Smoke 
GEE1 (uncond) 
−0.199 
0.173 
−1.15 
0.250 
GEE1 (cond) 
−0.200 
0.172 
−1.16 
0.246 

GEE1 (Gaussian) 
−0.199 
0.173 
−1.15 
0.250 

GEE2 
−0.266 
0.174 
−1.53 
0.127 

ML 
−0.206 
0.172 
−1.20 
0.231 

Gest. Age 
GEE1 (uncond) 
−0.050 
0.038 
−1.31 
0.190 
GEE1 (cond) 
−0.044 
0.038 
−1.15 
0.249 

GEE1 (Gaussian) 
−0.049 
0.038 
−1.28 
0.202 

GEE2 
−0.056 
0.036 
−1.55 
0.122 

ML 
−0.043 
0.034 
−1.26 
0.207 

Low Birth Wt. 
GEE1 (uncond) 
0.076 
0.194 
0.39 
0.694 
GEE1 (cond) 
0.100 
0.192 
0.52 
0.603 

GEE1 (Gaussian) 
0.081 
0.193 
0.42 
0.677 

GEE2 
0.040 
0.188 
0.21 
0.830 

ML 
0.096 
0.173 
0.55 
0.581 

ρ 
GEE1 (uncond) 
0.167 
0.083 
2.02 
0.043 
GEE1 (cond) 
0.289 
0.063 
4.63 
<0.001 

GEE1 (Gaussian) 
0.192 
0.059 
3.23 
0.001 

GEE2 
0.167 
0.076 
2.21 
0.027 

ML 
0.206 
0.050 
4.08 
<0.001 
In Table 3.6, we consider estimates based on the five methods presented in Table 3.5 when the “working correlation” is assumed to be firstorder autoregressive (AR(1)). In general, the pattern of results in Table 3.6 is similar to that in Table 3.5. With a modest correlation, it is perhaps not surprising that the estimates of the regression parameters under AR(1) or exchangeable should be so similar. The imbalance in the data due to missingness probably accounts for some of the small differences when the results in Table 3.5 and Table 3.6 are both the “exchangeable” and “serial” odds ratio models as linear models for the log odds ratio, with
for the “serial” odds ratio model, and
for the “exchangeable” odds ratio model.
To estimate β and α, under an “exchangeable” or “serial” odds ratio pattern, we can use the estimating equations proposed by Lipsitz, Laird, and Harrington (1990) based on unconditional residuals or the estimating equations proposed by Carey, Zeger, and Diggle (1993) based on conditional residuals; in general, the latter yield more efficient estimators of α . The results of fitting the two models for the odds ratio pattern using both methods of estimation are presented in Table 3.7. The pattern of results in Table 3.7 is similar to those in Table 3.5 and Table 3.6, confirming that GEE1 methods are relatively insensitive to choice of models and methods of estimation for the withinsubject association. Interestingly, the potential benefits of GEE1 based on conditional residuals for estimation of the withinsubject association are highlighted in Table 3.7 where the standard error for the estimated log odds ratio is discernibly smaller than for GEE1 based on unconditional residuals. For the parameter of main interest, the timebyHIV status interaction, the estimates of effects are similar to those found in Table 3.5 and Table 3.6.
Effect 
Method 
Odds Ratio Model 
Estimate 
SE 
Wald Z 
pValue 

Intercept 
GEE1 (uncond) 
exc 
1.870 
1.491 
1.25 
0.210 
GEE1 (cond) 
exc 
1.797 
1.492 
1.20 
0.228 

GEE1 (uncond) 
serial 
2.003 
1.503 
1.33 
0.183 

GEE1 (cond) 
serial 
1.810 
1.506 
1.20 
0.230 

Time 
GEE1 (uncond) 
exc 
−0.618 
0.096 
−6.46 
<0.001 
GEE1 (cond) 
exc 
−0.623 
0.096 
−6.51 
<0.001 

GEE1 (uncond) 
serial 
−0.618 
0.097 
−6.41 
<0.001 

GEE1 (cond) 
serial 
−0.637 
0.097 
−6.58 
<0.001 

HIV 
GEE1 (uncond) 
exc 
−0.075 
0.265 
−0.28 
0.777 
GEE1 (cond) 
exc 
−0.078 
0.265 
−0.30 
0.768 

GEE1 (uncond) 
serial 
−0.061 
0.263 
−0.23 
0.816 

GEE1 (cond) 
serial 
−0.060 
0.263 
−0.23 
0.819 

Time*HIV 
GEE1 (uncond) 
exc 
0.257 
0.150 
1.72 
0.086 
GEE1 (cond) 
exc 
0.269 
0.149 
1.80 
0.071 

GEE1 (uncond) 
serial 
0.232 
0.154 
1.51 
0.132 

GEE1 (cond) 
serial 
0.256 
0.152 
1.68 
0.093 

Mom Smoke 
GEE1 (uncond) 
exc 
−0.202 
0.169 
−1.20 
0.232 
GEE1 (cond) 
exc 
−0.203 
0.168 
−1.21 
0.228 

GEE1 (uncond) 
serial 
−0.194 
0.172 
−1.13 
0.258 

GEE1 (cond) 
serial 
−0.194 
0.171 
−1.13 
0.257 

Gest. Age 
GEE1 (uncond) 
exc 
−0.047 
0.038 
−1.24 
0.214 
GEE1 (cond) 
exc 
−0.045 
0.038 
−1.19 
0.234 

GEE1 (uncond) 
serial 
−0.050 
0.038 
−1.31 
0.189 

GEE1 (cond) 
serial 
−0.044 
0.038 
−1.16 
0.244 

Low Birth Wt. 
GEE1 (uncond) 
exc 
0.101 
0.188 
0.54 
0.589 
GEE1 (cond) 
exc 
0.110 
0.187 
0.59 
0.555 

GEE1 (uncond) 
serial 
0.087 
0.192 
0.45 
0.652 

GEE1 (cond) 
serial 
0.111 
0.190 
0.58 
0.559 

Log(OR) 
GEE1 (uncond) 
exc 
0.859 
0.931 
0.92 
0.356 
GEE1 (cond) 
exc 
1.051 
0.281 
3.74 
<0.001 

GEE1 (uncond) 
serial 
0.788 
0.999 
0.79 
0.430 

GEE1 (cond) 
serial 
1.369 
0.301 
4.54 
<0.001 
As discussed in the previous sections, Liang and Zeger (1986) developed generalized estimating equations to estimate the parameters of marginal models; the latter class of models represent a particular extension of generalized linear models to longitudinal data. Liang and Zeger (1986) introduced a class of momentbased estimating equations that yield consistent estimators of the regression parameters, and their variances, under relatively mild conditions. Over the past 20 years, the GEE approach has been the dominant method for estimation of marginal models for longitudinal data; so much so, indeed, that some authors confusingly refer to marginal models as “GEE models.” However, it is worth emphasizing that GEE methods are but one approach for estimating marginal model parameters. Moreover, GEE methods should not be regarded as conjoined at the hip to marginal models; the estimating equation approach can be applied equally to alternative classes of models for longitudinal data, such as socalled transitional or responseconditional models (e.g., Markov models for longitudinal data) and GLMMs.
The GEE method is semiparametric, in that the estimating equations are derived without fully specifying the joint distribution of the vector of repeated measures. This is a very appealing feature of the GEE approach, especially for the analysis of discrete longitudinal data, because for the latter case the total number of parameters in the saturated model for the joint distribution of the vector of responses grows exponentially with the number of repeated measures. In particular, GEE methods only require specification of the form of the first two moments of the outcome vector, and provide consistent estimators of the marginal regression parameters under the weak condition that only the first moment is correctly specified. That is, provided the marginal model for the mean response (the first moment) is correctly specified, the estimating equations yield estimators of the marginal regression parameters that are consistent and asymptotically normal, regardless of whether the second moments have been correctly specified.
Similar to the quasilikelihood estimating equations originally proposed by Wedderburn (1974), the optimal choice of weights is to take V_{i} = Cov( Y i). Naturally, there are certain tradeoffs associated with the use of less than optimal weights. However, in general, if the correlation among repeated measures is not too high and/or the effects of primary interest are betweensubject effects, then the efficiency of GEE estimators even under the naive assumption of independence is remarkably high (Liang and Zeger, 1986). However, when there is interest in the effects of nonstochastic timevarying covariates and/or the repeated measures are highly correlated, then suboptimal choices of weights (i.e., misspecification of the covariance) can result in a discernible loss of efficiency. Similarly, with inherently unbalanced longitudinal data (e.g., widely varying n_{i} ), suboptimal choices of weights can also result in loss of efficiency. In general, more efficient estimators of the marginal regression parameters can be obtained by specifying and estimating the covariance among the repeated measures; however, the potential gains in efficiency will depend in a subtle way on both the magnitude of the correlation and the covariate design.
As mentioned earlier, the major impetus for the GEE approach was that it provided a convenient alternative to maximum likelihood estimation of marginal models. For the case of continuous outcomes, likelihoodbased methods are widely used for the analysis of longitudinal data, for example, general linear models and linear mixedeffects models. In general, likelihoodbased estimation of marginal models for discrete longitudinal data has proven to be very challenging, in large part because there is no simple unified joint likelihood for discrete longitudinal data. Unlike the multivariate normal distribution, which is completely specified by the first two moments of the outcome vector, for discrete longitudinal data complete specification of the joint distributions requires specifying third and higherorder moments. Although various likelihood approaches have been proposed — for example, models based on second and higherorder correlations (Bahadur, 1961; Zhao and Prentice, 1990) and models based on second and higherorder odds ratios (McCullagh and Nelder, 1989; Lipsitz, Laird, and Harrington, 1990; Liang, Zeger, and Qaqish, 1992; Becker and Balagtas, 1993; Fitzmaurice, Laird, and Rotnitzky, 1993; Molenberghs and Lesaffre, 1994; Lang and Agresti, 1994; Glonek and McCullagh, 1995; Bergsma and Rudas, 2002) — none of these likelihoodbased models have proven to be of real practical use except in relatively limited settings. As the number of repeated measures increases, the number of parameters that need to be specified and estimated proliferates rapidly for any of these joint distributions, and a solution to the likelihood equations quickly becomes intractable. Thus, in general, full likelihood approaches require the specification of too many nuisance parameters, are complicated algebraically, and ML estimation can be computationally prohibitive. Furthermore, to obtain unbiased estimators of the marginal regression parameters, the full joint distribution of the data must, in general, be correctly specified. In contrast, with GEE methods, only the first moment needs to be correctly specified in order to obtain unbiased estimators; moreover, the GEE approach is not computationally demanding.
We note that Heagerty (1999) and Heagerty and Zeger (2000) have recently developed a likelihoodbased approach that combines the versatility of GLMMs for modeling the withinsubject association with a marginal regression model for the mean response. They refer to these models as marginalized randomeffects models; they are closely related to other approaches that formulate conditional models subject to marginal specification (e.g., Fitzmaurice and Laird, 1993; Azzalini, 1994; Heagerty, 2002; see also Wang and Louis, 2003). Recall that in the standard GLMM, the marginal means obtained by integrating over the random effects, in general, no longer follow a generalized linear model, due to the nonlinearity of the link function typically adopted in regression models for discrete responses. In contrast, the marginalized randomeffects model is specifically formulated such that the marginal mean follows a generalized linear model. Estimation for these marginalized randomeffects models is as computationally demanding as for GLMMs, and ML estimation is, so far, limited to relatively lowdimensional randomeffects distributions. However, these models appear to have the potential to overcome many of the limitations of previously proposed likelihoodbased methods and to provide a likelihoodbased alternative to GEE; further research is needed.
Finally, although an appealing feature of the GEE approach is its robustness to misspecification of the withinsubject association, there are settings where it can be appealing to model the covariance. To date, the implementations of GEE in standard statistical software packages provide only very limited options for modeling the covariance. In particular, there are few choices of models for the “working covariance” when the data are highly unbalanced and irregularly spaced in time. This is in contrast to models for continuous responses (e.g., general linear models and linear mixedeffects models), where there are a broad class of models for the covariance. Future work is needed in both the formulation and implementation of flexible models for the working covariance in GEE methods.
This work was supported by grants AI 60373, GM 29745, and MH 54693 from the U.S. National Institutes of Health.