# Cluster-robust Logistic Regression

Hey team,

I have a binary dependent variable and would like to do an analysis with cluster-robust standard errors. Is this possible in the estimatr universe?

Hi Flourish â€“ thanks for your question!

You can of course use `lm_robust` regardless of the outcome space of the DV. For example in an experiment where the treatment is cluster-assigned, the following is kosher and youâ€™ll get an good estimate of the ATE and correct SEs:

``````library(fabricatr)
library(estimatr)
library(randomizr)

dat <- fabricate(
clusters = add_level(N = 50, cluster_shock = rnorm(N)),
individuals = add_level(N = 4, individual_shock = rnorm(N),
treatment = cluster_ra(clusters = clusters),
outcome = rbinom(N, 1, prob = pnorm(cluster_shock + individual_shock)))
)
lm_robust(outcome ~ treatment, clusters = clusters, data = dat)
``````

(only small caveat is that if the clusters are of heterogeneous sizes, a small bias can creep in that goes away as the number of clusters increases)

sometimes the above is called a linear probability model, though I prefer to think of it as just a good estimator of the ATE, which doesnâ€™t require a â€śprobability modelâ€ť per se.

But I think youâ€™re thinking of logistic regression â€“ estimatr only does linear estimators like OLS, IV, and difference-in-means. For generalized linear models like logit or probit, youâ€™ll have to exit our workflow and perhaps try the `sandwich` package, which includes the `vcovCL` function for `glm` objects.

``````library(sandwich)
fit <- glm(outcome ~ treatment, family = binomial, data = dat)
vcovCL(fit, cluster = dat\$clusters)
``````

This is perfect! Thanks so much. Is there any interest in extending lm_robust to something like glm_robust that includes models like logit / probit?

Not at the moment â€“ the computational procedures for OLS and GLM are fundamentally different, so I think it would require building out a whole new feature set. Iâ€™ve also heard (though I really really really donâ€™t know!) that the statistical theory for clustered SEs for glms has a shakier basis? Again, thatâ€™s basically just character assassination by rumor â€“ I really donâ€™t know!

Perhaps thereâ€™s room for a separate package that just wraps together `sandwich` and `broom` with `estimatr`-like syntax for clustered standard errors in a tidy setting. That way we donâ€™t take responsibility for any computations

You can also do this with the clubSandwich package:

``````library(clubSandwich)
fit <- glm(outcome ~ treatment, family = binomial, data = dat)
vcov_CR(fit, cluster = dat\$clusters, type = "CR2")
``````

The `type = "CR2"` option gives a small-sample adjustment to the standard errors, which should improve on what you get from the sandwich package.

1 Like

My understanding is a bit different from Alex. AFAIK, the statisticaly theory underpinning clustered SEs with GLMs is very well developed and they are commonly used together for analysis of longitudinal binary/count data. See Liang & Zeger (1986) or one of many books on generalized estimating equations.

Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika , 73 (1), 13-22.

Not shakier per se, but they are MLEs and have the associated baggage (vs OLS can rely on Gauss Markov). GLMs can handle some heterosked. directly (eg in Poisson E(Y) = Var(Y) ) vs a quasi-poisson model is no longer MLE and a lot of classical results for SEs and tests break down. Pretty sure the sandwich package â€śdoes the right thingâ€ť.

But if you are working in GLMâ€™s, you are already making some strong distributional assumptions about your data, you may as well model the clusters explicitly in a multilevel/hierarchical fashion and you will probably get SEs similar sandwich and your model may fit the data a bit better for medium n.

Thank you James! I would substitute your understanding for mine any day. I really appreciate the cite (and also clubSandwich, which is fantastic.)