DeclareDesign Community

Cluster-robust Logistic Regression

Hey team,

I have a binary dependent variable and would like to do an analysis with cluster-robust standard errors. Is this possible in the estimatr universe?

Hi Flourish – thanks for your question!

You can of course use lm_robust regardless of the outcome space of the DV. For example in an experiment where the treatment is cluster-assigned, the following is kosher and you’ll get an good estimate of the ATE and correct SEs:

library(fabricatr)
library(estimatr)
library(randomizr)

dat <- fabricate(
  clusters = add_level(N = 50, cluster_shock = rnorm(N)),
  individuals = add_level(N = 4, individual_shock = rnorm(N),
                          treatment = cluster_ra(clusters = clusters),
                          outcome = rbinom(N, 1, prob = pnorm(cluster_shock + individual_shock)))
)
lm_robust(outcome ~ treatment, clusters = clusters, data = dat)

(only small caveat is that if the clusters are of heterogeneous sizes, a small bias can creep in that goes away as the number of clusters increases)

sometimes the above is called a linear probability model, though I prefer to think of it as just a good estimator of the ATE, which doesn’t require a “probability model” per se.

But I think you’re thinking of logistic regression – estimatr only does linear estimators like OLS, IV, and difference-in-means. For generalized linear models like logit or probit, you’ll have to exit our workflow and perhaps try the sandwich package, which includes the vcovCL function for glm objects.

library(sandwich)
fit <- glm(outcome ~ treatment, family = binomial, data = dat)
vcovCL(fit, cluster = dat$clusters)

This is perfect! Thanks so much. Is there any interest in extending lm_robust to something like glm_robust that includes models like logit / probit?

Not at the moment – the computational procedures for OLS and GLM are fundamentally different, so I think it would require building out a whole new feature set. I’ve also heard (though I really really really don’t know!) that the statistical theory for clustered SEs for glms has a shakier basis? Again, that’s basically just character assassination by rumor – I really don’t know!

Perhaps there’s room for a separate package that just wraps together sandwich and broom with estimatr-like syntax for clustered standard errors in a tidy setting. That way we don’t take responsibility for any computations :slight_smile:

You can also do this with the clubSandwich package:

library(clubSandwich)
fit <- glm(outcome ~ treatment, family = binomial, data = dat)
vcov_CR(fit, cluster = dat$clusters, type = "CR2")

The type = "CR2" option gives a small-sample adjustment to the standard errors, which should improve on what you get from the sandwich package.

1 Like

My understanding is a bit different from Alex. AFAIK, the statisticaly theory underpinning clustered SEs with GLMs is very well developed and they are commonly used together for analysis of longitudinal binary/count data. See Liang & Zeger (1986) or one of many books on generalized estimating equations.

Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika , 73 (1), 13-22.

Not shakier per se, but they are MLEs and have the associated baggage (vs OLS can rely on Gauss Markov). GLMs can handle some heterosked. directly (eg in Poisson E(Y) = Var(Y) ) vs a quasi-poisson model is no longer MLE and a lot of classical results for SEs and tests break down. Pretty sure the sandwich package “does the right thing”.

But if you are working in GLM’s, you are already making some strong distributional assumptions about your data, you may as well model the clusters explicitly in a multilevel/hierarchical fashion and you will probably get SEs similar sandwich and your model may fit the data a bit better for medium n.

Thank you James! I would substitute your understanding for mine any day. I really appreciate the cite (and also clubSandwich, which is fantastic.)