DeclareDesign Community

Identical and Repeating Standard Errors Repeating for All Levels in a Factor


I am testing the DeclareDesign package and found a strange issue where the regression estimates 60 coefficients, but only produces 2 unique standard errors (which repeat over and over).

I am using the lm_robust function to compute the estimated means with clustered SEs for a panel dataset where counties are classified as either Democrat or Republican.

To ensure that this is not a strange idiosyncrasy of the data, I have produced a reproducible example below.

Any insight into this would be greatly appreciated!

Many thanks,

Please see code for reproducible example:

#-------- Fabricate a panel with counties -----
# For each county,
#   (a) we have a county-level fixed effect: county_FE
#   (b) we have a classification as Democrat / Republican
# For each year,
#   (c) we have a year-level random shock: year_shock
# For each county-year,
#    (d) we have a measure of infrastructure spending:
#         infrastructure.spend = county_FE + year_shock


panel <- fabricate(
  county = add_level(N = 30, county_FE = runif(N, 1, 10)),
  years = add_level(N = 30, year_shock = runif(N, 1, 10), nest=FALSE),
  obs = cross_levels(by=join(county, years),
                     infrastructure.spend_it = county_FE + year_shock,
                     party = ifelse((as.numeric(county) %% 3) == 0,
                                    "Republican")) #split into 1/3 D and remaining 2/3 R

#-------- lm_robust: get means for party x year w. clustered SEs -----

#-------- create interaction term -----
panel %<>% mutate(year_by_party  = interaction(years, party))

#-------- run robust lm with clustering on county id --------------------
lm_robust.out <- lm_robust(data = panel,
                           formula = infrastructure.spend_it ~ -1 + year_by_party,
                           clusters = county,
                           se_type = "CR2",
                           ci = TRUE,
                           alpha = 0.05)

#-------- tidy output to get table of means --------------------
lm_robust.out %<>% tidy()
lm_robust.out %<>%
  mutate(term = str_remove(string = term, pattern = "year_by_party")) %>%
  separate(term, into = c("year", "party"), extra = "merge", fill = "left") %>%
  mutate(year = factor(year),
         party = factor(party))

#-------- Examine Standard Errors -----
table(lm_robust.out$std.error) #only two values - one for each party type

When you exclude an intercept and only use 1 factor, the X’X matrix is diagonal - in your case, only two values appear on that diagonal based on how the design is set up.

> model.matrix(~ -1 + year_by_party, panel) %>% crossprod %>% diag %>% table
10 20 
30 30 

This will carry through to SEs.

Thank you! I see now how that mechanistically follows as a result of the matrix algebra. I am not quite sure how to take this insight and use to devise a solution.

Can I use robust_lm to estimate the annual means (and, cluster-robust SEs for those estimated means) for Rep / Dem counties in the manner described in my original post?

At a first pass, it feels like I should be able to retrieve a Std.Error (cluster-robust or otherwise) for each estimated coefficient (which, in this case, corresponds to each Party-Year mean).

If you have any advice on profitable lines of inquiry to push forward, I would be grateful.

Many thanks,

You are getting that, it’s just that the SE is the same for each one because your design is nearly orthogonal. Var(\hat \beta) = \sigma^2 (X’X)^-1. X’X is diagonal with two values.

I think what you want is not a single point estimate of sigma^2 robust to clustering, but different estimates, \sigma_ij

Because there’s no terms shared across groups, we can divide and conquer. Here’s a quick and dirty one liner:

by(panel, panel$year_by_party, . %>% lm_robust(infrastructure.spend_it~1,.) %>% tidy)

However, this still has the same issue - your error term in the design is highly structured.

Try taking a look at your response variable in wide format, and z-score it:

reshape2::acast(panel, county~years, value.var = "infrastructure.spend_it") %>% scale

You can compare the means / SDs with the equivalent model,

lm_robust(infrastructure.spend_it~years-1, panel) %>% summary

Maybe you meant to add some noise at the year/country interaction level?


Thank you for this insight. I think you are onto something, though I am still trying to confirm whether splitting a dataset will adversely impact the accuracy of the estimated SE around a coefficient.

For my initial tests, it appears that:

produces estimates similar (if not identical) to:

lm_robust(data = panel,
                        formula = infrastructure.spend_it ~ -1 + year_by_party,
                        clusters = county,
                        se_type = "CR2",
                        ci = TRUE,
                        alpha = 0.05)

So, perhaps, I need to re-examine the simulation after adding noise as you suggested. As I learn more, I’ll be sure to update my post here.